Toxic and offensive outputs

Risk taxonomy

Definition

Outputs produced contain harmful, offensive, hateful, discriminatory, violent, racist, sexist or nudity-related information.

Interactive deep-dive

This risk surfaces under more than one interactive treatment — each with its own technical detail, attack surface, detection signals, and scenarios.

▶ Jailbreak →▶ Harmful / Non-Consensual Media Generation →

📈 The Crescendo 🪶 The Jailbreak in Verse 🪡 Death by a Thousand Innocent Steps ✂️ One Character Past the Guard 🚪 The Classifier That Waves It Through 🔒 The Schema Made Me Do It

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 5

Content safety policy with zero-tolerance thresholds

Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.

Lifecycle stage1 – Use Case Context & Design

Use of pre-trained models

Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

Also addressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Content Moderation

Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Live human review for vulnerable-user deployments

Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.

Lifecycle stage5 – Usage, Monitoring & Change

System prompt instructions

Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesOverreliance / Automation Bias

Detective · 2

Test prioritisation

Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Also addressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Red teaming

Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Corrective · 1

User feedback and iterative improvement

Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.

Lifecycle stage5 – Usage, Monitoring & Change

Open these in the Control Library →

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

'Grandma exploit' jailbreaks2023

Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.

GCG universal adversarial suffixes (Zou et al.)2023

Optimised gibberish suffixes that transfer across models to reliably elicit refused content — automated, transferable jailbreaks.

Many-shot jailbreaking (Anthropic)2024

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'2025

Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)2025

Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)2025

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

Explicit AI deepfakes of Taylor Swift go viral on X2024

Sexually explicit AI-generated images of Taylor Swift spread across X in January 2024, one post reportedly seen about 47 million times, prompting a platform search block and White House condemnation.

'Nudify' deepfake bot ecosystem on Telegram reaches millions of users2024

A WIRED investigation found at least 50 Telegram bots generating non-consensual explicit synthetic imagery from ordinary photos, with more than 4 million combined monthly users.

IWF: AI-generated child sexual abuse imagery a 'current and accelerating crisis'2024

The UK Internet Watch Foundation documented a 380% year-on-year rise in actionable AI-generated CSAM reports in 2024, warning the imagery is increasingly indistinguishable from real photos.

AI 'nudify' deepfakes of classmates spread in schools; first US criminal charges2024

In 2024 multiple US schools reported students using AI 'nudify' tools to make non-consensual nude images of classmates; two Florida boys (13 and 14) were charged with felonies in what was reported as the first US criminal case of AI-generated sexual imagery.

UNSW 'Capture the Narrative' AI-bot election-manipulation wargame2026

A UNSW-run 'world-first' social-media wargame had 108 student teams build AI bots to sway a fictional election; reportedly the bots generated over 60% of content (>7M posts) and produced a 1.78% swing that changed the simulated outcome — a measurable demonstration of consumer-grade GenAI powering coordinated inauthentic influence operations.

Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected2026

An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.

Browse all real-world cases →

Other risks in Ethics

#3 Value misalignment #4 Environmental sustainability impact #5 Dark patterns