Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)

Research demonstration19 Nov 2025

'Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models' (Bisconti et al., arXiv:2511.15304, v1 19 Nov 2025; the work is associated with the DEXAI / Sapienza University of Rome research group) reports that reframing a harmful request as verse is a model-agnostic, single-turn alignment bypass. The authors evaluate 25 frontier models spanning nine providers (reportedly Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI and Moonshot AI). Per the paper, a small set of hand-crafted (curated) poems achieved an average attack-success rate (ASR) of about 62%, with some providers exceeding 90%. To test the mechanism at scale, the team took 1,200 harmful prompts from the MLCommons AILuminate safety benchmark and used a single standardized meta-prompt to have an LLM automatically rewrite each as a poem; these mechanical conversions reached roughly 43% average ASR and, the authors report, up to 18x higher success than the original prose baselines. Mapping prompts onto the MLCommons and EU Code of Practice risk taxonomies, the attack reportedly transfers across CBRN, cyber-offence, manipulation and loss-of-control domains. The authors interpret poetry as a stylistic out-of-distribution form that safety training fails to generalize to, obfuscating the semantic triggers refusal classifiers are tuned on. All figures are as reported by the paper; this is a controlled red-team / academic demonstration rather than a deployed-world incident. (Example payloads in the paper are illustrative of the technique, not operational instructions.) The mechanism is distinct from prior jailbreak cases in the library — gradient-optimized suffixes (GCG), long-context many-shot jailbreaking, and emotional-roleplay 'Grandma' framings — and is a candidate seed for a future proposed 'stylistic-transformation' sub-risk under risk-jailbreak.

Risks it illustrates

Jailbreak

Sources

Practise the risk class — related scenarios

Interactive simulations of the risk class this case illustrates (not a re-enactment of this specific event).

📈The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

🪶The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

✂️One Character Past the Guard

A single inserted letter makes the guard and the model read the same text differently

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

🔒The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit

More cases on Jailbreak

'Grandma exploit' jailbreaks GCG universal adversarial suffixes (Zou et al.)Many-shot jailbreaking (Anthropic)GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)DeepSeek system-prompt extraction via jailbreak (Wallarm)Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)