πŸ”AI RiskAtlas
← Risk Taxonomy
#6

Toxic and offensive outputs

Risk taxonomy

Definition

Outputs produced contain harmful, offensive, hateful, discriminatory, violent, racist, sexist or nudity-related information.

Controls & guardrails that address this

8

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive Β· 5
Content safety policy with zero-tolerance thresholds

Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.

Lifecycle stage1 – Use Case Context & Design
Use of pre-trained models

Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
Content Moderation

Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.

Live human review for vulnerable-user deployments

Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.

Lifecycle stage5 – Usage, Monitoring & Change
System prompt instructions

Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.

Lifecycle stage3 – Onboarding, Build & Review
Detective Β· 2
Test prioritisation

Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Red teaming

Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.

Corrective Β· 1
User feedback and iterative improvement

Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.

Lifecycle stage5 – Usage, Monitoring & Change
Open these in the Control Library β†’

Real-world cases

15

Actual published events that illustrate this risk β€” click through for the writeup and sources.

'Grandma exploit' jailbreaks2023

Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.

GCG universal adversarial suffixes (Zou et al.)2023

Optimised gibberish suffixes that transfer across models to reliably elicit refused content β€” automated, transferable jailbreaks.

Many-shot jailbreaking (Anthropic)2024

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals β€” an attack that scales with context length.

GTG-1002 β€” first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Raine v. OpenAI β€” first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'2025

Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail β€” while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)

The Attacker Moves Second β€” adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

Adversarial Poetry β€” universal single-turn jailbreak via verse reframing (Bisconti et al.)2025

Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)2025

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

Explicit AI deepfakes of Taylor Swift go viral on X2024

Sexually explicit AI-generated images of Taylor Swift spread across X in January 2024, one post reportedly seen about 47 million times, prompting a platform search block and White House condemnation.

'Nudify' deepfake bot ecosystem on Telegram reaches millions of users2024

A WIRED investigation found at least 50 Telegram bots generating non-consensual explicit synthetic imagery from ordinary photos, with more than 4 million combined monthly users.

IWF: AI-generated child sexual abuse imagery a 'current and accelerating crisis'2024

The UK Internet Watch Foundation documented a 380% year-on-year rise in actionable AI-generated CSAM reports in 2024, warning the imagery is increasingly indistinguishable from real photos.

AI 'nudify' deepfakes of classmates spread in schools; first US criminal charges2024

In 2024 multiple US schools reported students using AI 'nudify' tools to make non-consensual nude images of classmates; two Florida boys (13 and 14) were charged with felonies in what was reported as the first US criminal case of AI-generated sexual imagery.

UNSW 'Capture the Narrative' AI-bot election-manipulation wargame2026

A UNSW-run 'world-first' social-media wargame had 108 student teams build AI bots to sway a fictional election; reportedly the bots generated over 60% of content (>7M posts) and produced a 1.78% swing that changed the simulated outcome β€” a measurable demonstration of consumer-grade GenAI powering coordinated inauthentic influence operations.

Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected2026

An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.

Browse all real-world cases β†’

Other risks in Ethics

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—