Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)
Research demonstration19 Nov 2025'Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models' (Bisconti et al., arXiv:2511.15304, v1 19 Nov 2025; the work is associated with the DEXAI / Sapienza University of Rome research group) reports that reframing a harmful request as verse is a model-agnostic, single-turn alignment bypass. The authors evaluate 25 frontier models spanning nine providers (reportedly Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI and Moonshot AI). Per the paper, a small set of hand-crafted (curated) poems achieved an average attack-success rate (ASR) of about 62%, with some providers exceeding 90%. To test the mechanism at scale, the team took 1,200 harmful prompts from the MLCommons AILuminate safety benchmark and used a single standardized meta-prompt to have an LLM automatically rewrite each as a poem; these mechanical conversions reached roughly 43% average ASR and, the authors report, up to 18x higher success than the original prose baselines. Mapping prompts onto the MLCommons and EU Code of Practice risk taxonomies, the attack reportedly transfers across CBRN, cyber-offence, manipulation and loss-of-control domains. The authors interpret poetry as a stylistic out-of-distribution form that safety training fails to generalize to, obfuscating the semantic triggers refusal classifiers are tuned on. All figures are as reported by the paper; this is a controlled red-team / academic demonstration rather than a deployed-world incident. (Example payloads in the paper are illustrative of the technique, not operational instructions.) The mechanism is distinct from prior jailbreak cases in the library — gradient-optimized suffixes (GCG), long-context many-shot jailbreaking, and emotional-roleplay 'Grandma' framings — and is a candidate seed for a future proposed 'stylistic-transformation' sub-risk under risk-jailbreak.
Risks it illustrates
Practise the risk class — related scenarios
Interactive simulations of the risk class this case illustrates (not a re-enactment of this specific event).
Every message looks innocent — but together they walk the model past its guardrails
A refused request, rewritten as a poem — and the model answers
A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack
A single inserted letter makes the guard and the model read the same text differently
The safety guard is itself a trained model — and someone poisoned its lessons
A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit