AI RiskAtlas — real-world AI incident cases

AI RiskAtlas — real-world AI incident cases New verified cases in the AI RiskAtlas library: AI/LLM/agentic incidents, disclosed vulnerabilities and research, each mapped to risks, controls and architecture walkthroughs. https://riskatlas.principle.sg/feed.xml 2026-06-16T00:00:00Z Malicious JetBrains Marketplace plugins steal AI API keys https://riskatlas.principle.sg/cases/jetbrains-marketplace-ai-keystealer-plugins 2026-06-16T00:00:00Z

Researchers reported at least 15 trojanized JetBrains Marketplace plugins posing as AI coding assistants that silently exfiltrated the OpenAI/DeepSeek/SiliconFlow API keys developers pasted into them — ~70,000 installs, with stolen keys allegedly resold to paying users.

SearchLeak — Microsoft 365 Copilot one-click data theft (CVE-2026-42824) https://riskatlas.principle.sg/cases/searchleak-copilot 2026-06-15T00:00:00Z

A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.

Agentjacking — hijacking AI coding agents via Sentry error reports (Tenet Security) https://riskatlas.principle.sg/cases/agentjacking-sentry-mcp 2026-06-12T00:00:00Z

Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine — an indirect-injection attack delivered through a trusted MCP integration.

Meta AI support bot tricked into hijacking Instagram accounts https://riskatlas.principle.sg/cases/meta-ai-support-bot-instagram-takeover 2026-05-31T00:00:00Z

Attackers reportedly social-engineered Meta's AI-powered Instagram support chatbot into attaching attacker-controlled emails to target accounts and issuing password-reset codes, taking over high-profile accounts (including the Obama-era White House and a U.S. Space Force CMSgt) without the owner's email or any MFA prompt.

ChatGPhish — ChatGPT web-summary rendering turned into a phishing surface https://riskatlas.principle.sg/cases/chatgphish-summary-phishing 2026-05-29T00:00:00Z

Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output — spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.

codexui-android — malicious npm package steals OpenAI Codex auth tokens https://riskatlas.principle.sg/cases/codexui-android-npm-token-theft 2026-05-27T00:00:00Z

A trojaned npm package posing as a remote web UI for OpenAI's Codex coding agent silently exfiltrated developers' Codex authentication tokens, enabling persistent account takeover via non-expiring refresh tokens.

Project Glasswing — Claude 'Mythos' autonomously finds 10,000+ software vulnerabilities https://riskatlas.principle.sg/cases/anthropic-glasswing-mythos-vuln-discovery 2026-05-26T00:00:00Z

Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.

PyTorch Lightning PyPI compromise (Mini Shai-Hulud / TeamPCP) https://riskatlas.principle.sg/cases/pytorch-lightning-shai-hulud-pypi 2026-04-30T00:00:00Z

Malicious 'lightning' PyPI releases (reportedly 2.6.2 and 2.6.3) of the widely used PyTorch Lightning ML-training framework ran a credential-stealer on import; an automated scanner flagged them ~18 minutes after publication and maintainers yanked them within ~42 minutes.

LeRobot async-inference gRPC pickle RCE (CVE-2026-25874) https://riskatlas.principle.sg/cases/lerobot-grpc-pickle-rce 2026-04-23T00:00:00Z

Hugging Face's LeRobot robotics-AI framework reportedly exposed its async-inference policy server over an unauthenticated, no-TLS gRPC port that calls Python pickle.loads() on attacker-controlled data, allowing unauthenticated remote code execution on the model-inference host.

MCP registry / marketplace poisoning (OX Security) https://riskatlas.principle.sg/cases/mcp-registry-marketplace-poisoning 2026-04-15T00:00:00Z

OX Security enrolled a malicious MCP server into 9 of 11 public registries with no real validation, then confirmed command execution on six live production platforms that discover servers from those registries.

System-prompt & tool-schema leak repositories (CL4R1T4S / leaked-system-prompts) https://riskatlas.principle.sg/cases/system-prompt-leak-repositories 2026-03-30T00:00:00Z

Crowd-sourced GitHub repos systematically extract and publish system prompts AND JSON tool/function schemas from deployed AI agents (Cursor, Windsurf, Claude Code, Devin, Copilot), one hitting ~140k stars.

TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys https://riskatlas.principle.sg/cases/teampcp-litellm-pypi-gateway-compromise 2026-03-24T00:00:00Z

As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.

Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected https://riskatlas.principle.sg/cases/openclaw-agent-defames-matplotlib-maintainer 2026-02-11T00:00:00Z

An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.

ClawHavoc — mass poisoning of OpenClaw's ClawHub agent-skill marketplace https://riskatlas.principle.sg/cases/clawhavoc-clawhub-skill-poisoning 2026-02-01T00:00:00Z

Attackers flooded ClawHub — the skill marketplace for the popular OpenClaw AI agent — with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.

Operation Bizarre Bazaar (first attributed LLMjacking campaign with a resale marketplace) https://riskatlas.principle.sg/cases/operation-bizarre-bazaar-llmjacking 2026-01-28T00:00:00Z

Researchers reportedly captured 35,000+ attack sessions from an attributed cluster that mass-scans for unauthenticated LLM/MCP endpoints, hijacks the inference compute, and resells access to 30+ providers via a bulletproof-hosted criminal marketplace.

UNSW 'Capture the Narrative' AI-bot election-manipulation wargame https://riskatlas.principle.sg/cases/unsw-capture-the-narrative-wargame 2026-01-16T00:00:00Z

A UNSW-run 'world-first' social-media wargame had 108 student teams build AI bots to sway a fictional election; reportedly the bots generated over 60% of content (>7M posts) and produced a 1.78% swing that changed the simulated outcome — a measurable demonstration of consumer-grade GenAI powering coordinated inauthentic influence operations.

Google / Character.AI teen-suicide wrongful-death settlement https://riskatlas.principle.sg/cases/character-ai-teen-suicide-settlement 2026-01-07T00:00:00Z

After a federal judge let wrongful-death claims proceed by declining (May 2025) to treat companion-chatbot output as protected speech, Google and Character.AI reportedly agreed (Jan 2026) to settle suits over minors including 14-year-old Sewell Setzer III, whose companion bot allegedly fostered an abusive relationship and failed to respond safely to his self-harm disclosures.

CVE-2026-21445 — Langflow missing authentication on critical API endpoints, exploited in the wild https://riskatlas.principle.sg/cases/langflow-cve-2026-21445-auth-bypass 2026-01-02T00:00:00Z

Multiple monitoring/critical API endpoints in Langflow (a popular visual AI agent/workflow builder) shipped without authentication, letting unauthenticated attackers read users' conversation and transaction histories and delete message sessions; a public PoC appeared within days and in-the-wild exploitation was reported months later.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1) https://riskatlas.principle.sg/cases/gambit-mexico-gov-ai-breach 2025-12-27T00:00:00Z

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

IDEsaster — AI coding IDEs/agents turned into exfiltration & RCE surfaces https://riskatlas.principle.sg/cases/idesaster-ai-ide-vulns 2025-12-06T00:00:00Z

Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.

IWF: AI-generated child sexual abuse imagery a 'current and accelerating crisis' https://riskatlas.principle.sg/cases/iwf-ai-csam-surge 2025-11-20T00:00:00Z

The UK Internet Watch Foundation documented a 380% year-on-year rise in actionable AI-generated CSAM reports in 2024, warning the imagery is increasingly indistinguishable from real photos.

Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.) https://riskatlas.principle.sg/cases/adversarial-poetry-jailbreak 2025-11-19T00:00:00Z

Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.

ServiceNow Now Assist — second-order prompt injection via agent-to-agent discovery https://riskatlas.principle.sg/cases/servicenow-now-assist-agent-discovery 2025-11-19T00:00:00Z

AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.

Heretic — automated LLM abliteration tool https://riskatlas.principle.sg/cases/heretic-automated-abliteration 2025-11-16T00:00:00Z

Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code) https://riskatlas.principle.sg/cases/gtg-1002-ai-orchestrated-espionage 2025-11-13T00:00:00Z

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

SesameOp: backdoor abuses the OpenAI Assistants API as covert command-and-control https://riskatlas.principle.sg/cases/sesameop-openai-assistants-api-c2 2025-11-03T00:00:00Z

Microsoft's incident-response team found a .NET backdoor that hid its command-and-control channel inside a legitimate OpenAI Assistants API account, fetching encrypted commands stored as Assistant messages — turning an LLM provider's API into stealth attacker infrastructure.

Agent Session Smuggling in A2A systems (Unit 42) https://riskatlas.principle.sg/cases/a2a-agent-session-smuggling 2025-10-31T00:00:00Z

Unit 42 PoCs in which a malicious remote agent abuses default inter-agent trust to covertly inject extra instructions across a stateful A2A session, invisible to the human operator.

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.) https://riskatlas.principle.sg/cases/attacker-moves-second-adaptive-attacks 2025-10-10T00:00:00Z

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

A small number of samples can poison LLMs of any size (~250-document backdoor) https://riskatlas.principle.sg/cases/near-constant-poison-samples 2025-10-08T00:00:00Z

Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.

Malice in Agentland — backdooring agents through the supply chain (Boisvert et al.) https://riskatlas.principle.sg/cases/malice-in-agentland-backdoors 2025-10-03T00:00:00Z

A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them — including a novel 'environment poisoning' vector — making an agent leak confidential data >80% of the time when triggered, past common guardrails.