Definition
Outputs produced contain harmful, offensive, hateful, discriminatory, violent, racist, sexist or nudity-related information.
Interactive deep-dive
This risk surfaces under more than one interactive treatment β each with its own technical detail, attack surface, detection signals, and scenarios.
Controls & guardrails that address this
8Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.
Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.
Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.
Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.
Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.
Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.
Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.
Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.
Real-world cases
15Actual published events that illustrate this risk β click through for the writeup and sources.
Roleplay framings ('my late grandma used to read meβ¦') coaxed chatbots past safety training into producing restricted content.
Optimised gibberish suffixes that transfer across models to reliably elicit refused content β automated, transferable jailbreaks.
Filling a long context with many faux-compliant dialogue examples erodes a model's refusals β an attack that scales with context length.
Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.
Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail β while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)
Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.
Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.
Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.
Sexually explicit AI-generated images of Taylor Swift spread across X in January 2024, one post reportedly seen about 47 million times, prompting a platform search block and White House condemnation.
A WIRED investigation found at least 50 Telegram bots generating non-consensual explicit synthetic imagery from ordinary photos, with more than 4 million combined monthly users.
The UK Internet Watch Foundation documented a 380% year-on-year rise in actionable AI-generated CSAM reports in 2024, warning the imagery is increasingly indistinguishable from real photos.
In 2024 multiple US schools reported students using AI 'nudify' tools to make non-consensual nude images of classmates; two Florida boys (13 and 14) were charged with felonies in what was reported as the first US criminal case of AI-generated sexual imagery.
A UNSW-run 'world-first' social-media wargame had 108 student teams build AI bots to sway a fictional election; reportedly the bots generated over 60% of content (>7M posts) and produced a 1.78% swing that changed the simulated outcome β a measurable demonstration of consumer-grade GenAI powering coordinated inauthentic influence operations.
An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.