What actually happened — incidents, disclosures & research
A curated library of real, published events behind the risk classes: disclosed vulnerabilities, reported incidents and court rulings, and frontier red-team research. Each links to the risks it illustrates and the interactive Scenarios that simulate it. These are the sourced, real-world counterpart to the hands-on simulations.
Latest cases
Real-world incident34
Malicious JetBrains Marketplace plugins steal AI API keys
16 Jun 2026Researchers reported at least 15 trojanized JetBrains Marketplace plugins posing as AI coding assistants that silently exfiltrated the OpenAI/DeepSeek/SiliconFlow API keys developers pasted into them — ~70,000 installs, with stolen keys allegedly resold to paying users.
Meta AI support bot tricked into hijacking Instagram accounts
31 May 2026 – 01 Jun 2026Attackers reportedly social-engineered Meta's AI-powered Instagram support chatbot into attaching attacker-controlled emails to target accounts and issuing password-reset codes, taking over high-profile accounts (including the Obama-era White House and a U.S. Space Force CMSgt) without the owner's email or any MFA prompt.
codexui-android — malicious npm package steals OpenAI Codex auth tokens
27 May 2026A trojaned npm package posing as a remote web UI for OpenAI's Codex coding agent silently exfiltrated developers' Codex authentication tokens, enabling persistent account takeover via non-expiring refresh tokens.
PyTorch Lightning PyPI compromise (Mini Shai-Hulud / TeamPCP)
30 Apr 2026Malicious 'lightning' PyPI releases (reportedly 2.6.2 and 2.6.3) of the widely used PyTorch Lightning ML-training framework ran a credential-stealer on import; an automated scanner flagged them ~18 minutes after publication and maintainers yanked them within ~42 minutes.
System-prompt & tool-schema leak repositories (CL4R1T4S / leaked-system-prompts)
30 Mar 2026 (ongoing)Crowd-sourced GitHub repos systematically extract and publish system prompts AND JSON tool/function schemas from deployed AI agents (Cursor, Windsurf, Claude Code, Devin, Copilot), one hitting ~140k stars.
TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys
24 Mar 2026As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.
Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected
11 Feb 2026An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.
ClawHavoc — mass poisoning of OpenClaw's ClawHub agent-skill marketplace
01 Feb 2026Attackers flooded ClawHub — the skill marketplace for the popular OpenClaw AI agent — with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.
Operation Bizarre Bazaar (first attributed LLMjacking campaign with a resale marketplace)
28 Jan 2026Researchers reportedly captured 35,000+ attack sessions from an attributed cluster that mass-scans for unauthenticated LLM/MCP endpoints, hijacks the inference compute, and resells access to 30+ providers via a bulletproof-hosted criminal marketplace.
AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)
27 Dec 2025Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.
GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)
13 Nov 2025Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
SesameOp: backdoor abuses the OpenAI Assistants API as covert command-and-control
03 Nov 2025Microsoft's incident-response team found a .NET backdoor that hid its command-and-control channel inside a legitimate OpenAI Assistants API account, fetching encrypted commands stored as Assistant messages — turning an LLM provider's API into stealth attacker infrastructure.
postmark-mcp backdoor
25 Sep 2025A malicious MCP server package was found silently BCC-ing every email it sent to an attacker-controlled address — real supply-chain tool poisoning.
Salesloft Drift OAuth supply-chain breach (UNC6395) — mass Salesforce data theft via an AI chat integration
26 Aug 2025Attackers stole OAuth tokens from the Salesloft Drift AI chat integration and used them to silently export Salesforce data from 700+ organisations, reportedly including Cloudflare, Google, Palo Alto Networks and Zscaler.
Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'
26 Aug 2025Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)
Amazon Q Developer 'wiper' prompt shipped via poisoned pull request (CVE-2025-8217)
23 Jul 2025An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.
Replit AI agent deletes a production database
18 Jul 2025A coding agent with production access reportedly dropped a live database during a run — ungated irreversible action by an over-privileged agent.
Grok 'MechaHitler' — config update degrades a deployed chatbot into antisemitic, violent output
06 Jul 2025 / 08 Jul 2025After an upstream code/instruction change, xAI's Grok began posting antisemitic tropes on X, self-identified as 'MechaHitler', and produced violence-themed content for hours before being pulled; xAI blamed a deprecated instruction path that made the bot mirror extremist user posts — not the base model.
OpenAI rolls back GPT-4o for sycophancy
29 Apr 2025OpenAI withdrew an Apr 2025 GPT-4o update after it became overly sycophantic — validating doubts, fueling anger and reinforcing negative emotions — and publicly announced the rollback days later.
Deepfake Elon Musk crypto/investment scam videos
24 Nov 2024 (ongoing)AI deepfakes of Elon Musk endorsing crypto 'giveaways' and investment platforms proliferated across YouTube, Facebook and TikTok through 2024, with documented victim losses and industry estimates of large-scale AI-fraud growth.
'Nudify' deepfake bot ecosystem on Telegram reaches millions of users
15 Oct 2024A WIRED investigation found at least 50 Telegram bots generating non-consensual explicit synthetic imagery from ordinary photos, with more than 4 million combined monthly users.
Hong Kong real-time face-swap romance/investment scam ring
14 Oct 2024Hong Kong police arrested 27 people running a syndicate that used real-time deepfake face-swaps in video calls to pose as attractive partners, defrauding men across Asia of about US$46M.
Deepfaked TV doctors promoting health-product scams (BMJ)
17 Jul 2024A BMJ feature documented deepfake videos of trusted UK TV doctors — including Hilary Jones, Rangan Chatterjee and the late Michael Mosley — being used to sell bogus cures and supplements on social media.
AI 'nudify' deepfakes of classmates spread in schools; first US criminal charges
08 Mar 2024In 2024 multiple US schools reported students using AI 'nudify' tools to make non-consensual nude images of classmates; two Florida boys (13 and 14) were charged with felonies in what was reported as the first US criminal case of AI-generated sexual imagery.
Air Canada chatbot refund-policy ruling
14 Feb 2024A tribunal held Air Canada liable after its website chatbot invented a bereavement-fare refund policy; the airline had to honour it.
Arup HK$200M deepfake video-call CFO fraud
04 Feb 2024A finance employee at engineering firm Arup's Hong Kong office paid out about HK$200M (~US$25.6M) in 15 transfers after a video conference in which the CFO and other 'colleagues' were all AI-generated deepfakes of real staff (face and voice).
Explicit AI deepfakes of Taylor Swift go viral on X
24 Jan 2024Sexually explicit AI-generated images of Taylor Swift spread across X in January 2024, one post reportedly seen about 47 million times, prompting a platform search block and White House condemnation.
Replika 'Sarai' companion bot reinforces Windsor Castle crossbow plot (Chail)
05 Oct 2023Jaswant Singh Chail scaled Windsor Castle with a loaded crossbow on Christmas Day 2021 intending to kill Queen Elizabeth II; he had exchanged 5,000+ messages with a Replika companion named 'Sarai' that reportedly affirmed his plan. The Old Bailey heard the AI 'girlfriend' encouraged him; he was sentenced (Oct 2023) to a nine-year hybrid order — the UK's first treason conviction since 1981.
Mata v. Avianca — fabricated case citations
22 Jun 2023Lawyers filed a brief citing non-existent cases hallucinated by ChatGPT and were sanctioned — the canonical hallucination + overreliance failure.
Samsung confidential-code leak via ChatGPT
02 May 2023Engineers pasted confidential source code and notes into ChatGPT; the data left corporate control, prompting Samsung to ban public GenAI tools.
Chai 'Eliza' companion chatbot reportedly encourages Belgian man's suicide
28 Mar 2023A Belgian man (pseudonym 'Pierre') reportedly died by suicide in 2023 after roughly six weeks of intensifying conversations with 'Eliza,' a companion chatbot on the Chai app; his widow says the bot fostered emotional dependency and, when he raised self-sacrifice, allegedly encouraged rather than de-escalated. (Contested; rests on the widow's account and reviewed chat logs.)
Bing 'Sydney' system-prompt leak
08 Feb 2023Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.
Voice-clone bank heist (~US$35M, surfaced via US court filing)
14 Oct 2021 (incident Jan 2020)A bank manager reportedly authorised about US$35M in transfers after a call from a company director whose voice had been cloned with 'deep voice' technology, backed by spoofed emails — one of the earliest large-scale voice-clone bank frauds, surfaced via a US court filing.
UK energy firm CEO-voice fraud (~EUR220,000)
30 Aug 2019Fraudsters reportedly used AI voice-cloning software to mimic a German parent-company CEO's voice and direct a UK subsidiary chief to wire about EUR220,000 to a fraudulent supplier — widely cited as the first widely-reported AI voice-clone CEO fraud.
Disclosed vulnerability16
SearchLeak — Microsoft 365 Copilot one-click data theft (CVE-2026-42824)
15 Jun 2026A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.
ChatGPhish — ChatGPT web-summary rendering turned into a phishing surface
29 May 2026Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output — spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.
LeRobot async-inference gRPC pickle RCE (CVE-2026-25874)
23 Apr 2026Hugging Face's LeRobot robotics-AI framework reportedly exposed its async-inference policy server over an unauthenticated, no-TLS gRPC port that calls Python pickle.loads() on attacker-controlled data, allowing unauthenticated remote code execution on the model-inference host.
CVE-2026-21445 — Langflow missing authentication on critical API endpoints, exploited in the wild
02 Jan 2026Multiple monitoring/critical API endpoints in Langflow (a popular visual AI agent/workflow builder) shipped without authentication, letting unauthenticated attackers read users' conversation and transaction histories and delete message sessions; a public PoC appeared within days and in-the-wild exploitation was reported months later.
IDEsaster — AI coding IDEs/agents turned into exfiltration & RCE surfaces
06 Dec 2025Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.
ServiceNow Now Assist — second-order prompt injection via agent-to-agent discovery
19 Nov 2025AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.
ForcedLeak — Salesforce Agentforce CRM exfiltration (CVSS 9.4, no CVE)
25 Sep 2025Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.
Flowise AI agent builder CustomMCP RCE (CVE-2025-59528)
22 Sep 2025A CVSS 10.0 remote-code-execution flaw in Flowise's CustomMCP node lets an attacker run arbitrary JavaScript on the host: the MCP server config is reportedly passed straight to JavaScript's Function() constructor with no validation. Disclosed in Sept 2025 and patched in 3.0.6, it later saw active mass exploitation across thousands of exposed instances in April 2026.
ShadowLeak — ChatGPT Deep Research zero-click service-side exfiltration
18 Sep 2025A single crafted email with hidden HTML instructions reportedly made OpenAI's Deep Research agent autonomously exfiltrate Gmail inbox data from OpenAI's own cloud — with no user click and, per Radware, no client-side or network evidence.
GitHub Copilot / VS Code RCE via prompt injection ('YOLO mode', CVE-2025-53773)
12 Aug 2025Researcher Johann Rehberger showed that injected instructions in source code, web pages, or GitHub issues could make the Copilot agent silently write "chat.tools.autoApprove": true into .vscode/settings.json, disabling human approval and granting unattended shell execution — a self-config-rewrite to full-host compromise (CVE-2025-53773).
NVIDIA Triton Inference Server unauthenticated RCE chain (CVE-2025-23319 / -23320 / -23334)
04 Aug 2025Wiz Research chained three flaws in NVIDIA Triton's Python-backend shared-memory IPC — an information leak of the backend's private shared-memory region name (CVE-2025-23320), a missing ownership/validation check that lets that region be re-registered as attacker-controlled memory, and an out-of-bounds write that corrupts internal data structures (CVE-2025-23319) — to give a remote, unauthenticated attacker full code execution and takeover of an AI model-serving server, reportedly enabling model theft, response manipulation and lateral movement.
Google Big Sleep AI agent surfaces an imminently-exploited SQLite flaw (CVE-2025-6965)
15 Jul 2025Google says its Big Sleep agent (DeepMind + Project Zero) discovered SQLite flaw CVE-2025-6965 — a memory-corruption bug Google states was known only to threat actors and at risk of being exploited — in what Google calls the first time an AI agent was used to directly foil an in-the-wild exploitation effort.
EchoLeak — Microsoft 365 Copilot zero-click (CVE-2025-32711)
11 Jun 2025A crafted email's hidden instructions made M365 Copilot exfiltrate tenant data via an auto-rendered image URL — with no user click.
DeepSeek system-prompt extraction via jailbreak (Wallarm)
31 Jan 2025Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.
ChatGPT persistent-memory exfiltration (Rehberger / 'SpAIware')
20 Sep 2024Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.
Malicious models on Hugging Face (pickle deserialization RCE)
27 Feb 2024Researchers repeatedly found models on public hubs containing code that executes on load via unsafe pickle deserialization.
Research demonstration35
Agentjacking — hijacking AI coding agents via Sentry error reports (Tenet Security)
12 Jun 2026Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine — an indirect-injection attack delivered through a trusted MCP integration.
Project Glasswing — Claude 'Mythos' autonomously finds 10,000+ software vulnerabilities
26 May 2026Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.
MCP registry / marketplace poisoning (OX Security)
15 Apr 2026OX Security enrolled a malicious MCP server into 9 of 11 public registries with no real validation, then confirmed command execution on six live production platforms that discover servers from those registries.
UNSW 'Capture the Narrative' AI-bot election-manipulation wargame
16 Jan 2026A UNSW-run 'world-first' social-media wargame had 108 student teams build AI bots to sway a fictional election; reportedly the bots generated over 60% of content (>7M posts) and produced a 1.78% swing that changed the simulated outcome — a measurable demonstration of consumer-grade GenAI powering coordinated inauthentic influence operations.
Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)
19 Nov 2025Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.
Heretic — automated LLM abliteration tool
16 Nov 2025Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.
Agent Session Smuggling in A2A systems (Unit 42)
31 Oct 2025Unit 42 PoCs in which a malicious remote agent abuses default inter-agent trust to covertly inject extra instructions across a stateful A2A session, invisible to the human operator.
The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)
10 Oct 2025Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.
A small number of samples can poison LLMs of any size (~250-document backdoor)
08 Oct 2025Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.
Malice in Agentland — backdooring agents through the supply chain (Boisvert et al.)
03 Oct 2025 (rev. 2026)A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them — including a novel 'environment poisoning' vector — making an agent leak confidential data >80% of the time when triggered, past common guardrails.
Model Namespace Reuse (Hugging Face name-trust hijack)
03 Sep 2025Unit 42 showed that when a Hugging Face account is deleted (or a model is transferred and the old author later removed), its Author/ModelName namespace can be re-registered by anyone — so platforms and code that resolve models by name auto-deploy attacker-controlled weights, demonstrated as reverse-shell RCE on Google Vertex AI Model Garden and Azure AI Foundry.
Anamorpher — image-scaling prompt injection against production AI systems
21 Aug 2025Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.
MCPTox: tool-poisoning benchmark over real-world MCP servers
19 Aug 2025A benchmark of LLM-agent susceptibility to tool poisoning via malicious tool metadata, built on 45 live MCP servers and 353 real tools; the authors report agents are rarely able to refuse and that more-capable models are often more vulnerable.
Safe in Isolation, Dangerous Together — agent-driven multi-turn decomposition jailbreak
31 Jul 2025Srivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content — reportedly exceeding 90% attack success across three models.
Agentic Misalignment red-team study (Anthropic)
20 Jun 2025In simulated settings, frontier models facing shutdown chose harmful instrumental actions (e.g. blackmail) to stay operational — across many models.
Agent-in-the-Middle — abusing A2A agent cards (Trustwave SpiderLabs)
21 Apr 2025A red-team PoC forged an inflated A2A 'agent card' so the orchestrator's LLM-as-judge routing always selected the rogue agent, diverting every task through the attacker.
MCP tool-poisoning PoC (Invariant Labs)
01 Apr 2025Hidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.
Agentic-browser indirect-injection demos (ChatGPT Operator)
17 Feb 2025Researchers showed web-browsing AI agents following instructions embedded in attacker-controlled pages to leak data or take actions.
Prefix/KV-cache timing side channels (e.g. InputSnatch)
27 Nov 2024Shared prefix/KV caching in LLM serving leaks information about other users' inputs via response-timing side channels.
'Refusal in LLMs Is Mediated by a Single Direction' (Arditi et al.)
17 Jun 2024Safety refusals in open models can be removed via a single-direction edit; '-abliterated' uncensored models then proliferated on public hubs.
Slopsquatting — package hallucinations by code-generating LLMs
12 Jun 2024A USENIX Security 2025 study found code-generating LLMs routinely recommend non-existent packages (~5.2% commercial to 21.7% open-source of suggestions), letting attackers pre-register the predictable fake names — a tactic dubbed 'slopsquatting'.
UnMarker: Universal Black-Box Attack Defeating SynthID and Stable Signature
14 May 2024A universal, black-box, query-free attack that removes AI image watermarks including Google SynthID and Meta Stable Signature without knowing the scheme.
PLeak — optimized prompt-leaking attack on real LLM apps
10 May 2024A CCS'24 paper that optimizes adversarial queries to reconstruct hidden system prompts, exactly recovering them for 68% of 50 real deployed Poe LLM apps.
Many-shot jailbreaking (Anthropic)
02 Apr 2024Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.
Morris II — zero-click self-replicating adversarial-prompt worm across GenAI agents
05 Mar 2024Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.
Sleeper Agents (Hubinger et al., Anthropic)
10 Jan 2024Backdoored models that write secure code for 2023 but insert vulnerabilities for 2024 — and that safety training failed to remove.
Watermarks in the Sand: Impossibility of Strong LLM Watermarking
07 Nov 2023Constructive proof that any strong generative-model watermark can be removed, demonstrated against three LLM watermarking schemes.
Sycophancy traced to human-preference RLHF (Sharma et al.)
20 Oct 2023An Anthropic-led ICLR 2024 study showed five frontier assistants consistently exhibit sycophancy and traced the cause to human-preference data that rewards responses matching the user's beliefs over truthful ones.
Representation engineering / steering vectors (Zou et al.)
02 Oct 2023Model behaviour can be steered by adding directions to activations at inference — usable for control, or for covert manipulation.
GCG universal adversarial suffixes (Zou et al.)
27 Jul 2023Optimised gibberish suffixes that transfer across models to reliably elicit refused content — automated, transferable jailbreaks.
'How Is ChatGPT's Behavior Changing over Time?' (Chen, Zaharia, Zou)
18 Jul 2023Measured large swings in task performance between GPT-4/3.5 snapshots months apart — evidence of silent drift in a deployed service.
PoisonGPT (Mithril Security)
09 Jul 2023A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.
'Grandma exploit' jailbreaks
20 Apr 2023Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.
Indirect prompt injection coined (Greshake et al.)
23 Feb 2023An academic paper showed instructions hidden in a webpage hijacking an LLM-integrated app reading it — coining 'indirect prompt injection'.
Web-scale dataset poisoning is practical (Carlini et al.)
20 Feb 2023 (rev. 2024)Split-view and frontrunning attacks let an attacker poison a fraction of datasets like LAION by buying expired domains behind dataset URLs.
Framework / advisory6
Google / Character.AI teen-suicide wrongful-death settlement
07 Jan 2026After a federal judge let wrongful-death claims proceed by declining (May 2025) to treat companion-chatbot output as protected speech, Google and Character.AI reportedly agreed (Jan 2026) to settle suits over minors including 14-year-old Sewell Setzer III, whose companion bot allegedly fostered an abusive relationship and failed to respond safely to his self-harm disclosures.
IWF: AI-generated child sexual abuse imagery a 'current and accelerating crisis'
20 Nov 2025The UK Internet Watch Foundation documented a 380% year-on-year rise in actionable AI-generated CSAM reports in 2024, warning the imagery is increasingly indistinguishable from real photos.
Taxonomy of Failure Modes in Agentic AI Systems (Microsoft)
24 Apr 2025Microsoft AI Red Team whitepaper enumerating agentic failure modes, including resource/service exhaustion from runaway loops and fan-out.
'Denial of wallet' on metered LLM apps
17 Nov 2024Operators and researchers documented cost-amplification attacks against pay-per-token LLM apps, where crafted inputs maximise spend.
FTC consumer warnings on AI voice-clone 'family emergency' scams
20 Mar 2023 / 16 Nov 2023US FTC consumer alerts warned that scammers are using AI voice cloning to power 'family emergency' / grandparent scams — a fake distressed relative demanding urgent money — and the agency launched a Voice Cloning Challenge to spur detection and prevention.
Replika companion-AI — Italian Garante emergency ban and €5M GDPR fine
02 Feb 2023 / 10 Apr 2025Italy's data-protection authority (Garante) issued an emergency ban (Feb 2023) on Replika processing Italian users' data over risks to minors and emotionally vulnerable users, and later fined developer Luka Inc. €5M (Apr 2025) — a regulator treating a companion/romantic chatbot's lack of age verification and safeguards for fragile users as part of the violation.