🔍AI RiskAtlas
← Risk Taxonomy
#40

Data leakage

Risk taxonomy

Definition

Model outputs, or the development/training/fine-tuning process, inadvertently reveal sensitive, confidential or personal data to an unauthorised user — unwittingly, or via prompt injection that deliberately forces release of sensitive information.

Controls & guardrails that address this

111 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 5
Role-based access controls

Design the data access control architecture at design stage to prevent training data exfiltration through model outputs or APIs.

Lifecycle stages1 – Use Case Context & Design2 – Data Acquisition & Processing4 – Deployment
Input/output filtering

Implement output filtering to suppress PII and confidential information from model responses.

Query-time access-control filtering of the retrieval/RAG corpus by caller entitlements (document-level ACL enforcement)

Propagate source ACLs and classification labels onto every chunk at ingestion. Reject documents whose entitlements cannot be resolved.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 AC-3 / AC-4 Information Flow Enforcement; OWASP Agentic AI Threats & Mitigations (privilege compromise)
Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Output-side DLP inspection with named-entity and PII redaction on the response path

Scan every model response inline with DLP before delivery; redact or block PII, PAN and MNPI matches. Keep the rule set version-controlled.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-7(10) Prevent Exfiltration, SI-4
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Vet allowlisted egress destinations for server-side-fetch (SSRF) primitives; exclude or proxy-inspect any allowlisted service that can fetch arbitrary attacker-controlled URLs✚ proposed

An egress allowlist only contains exfiltration if no allowlisted destination can be coerced into fetching an attacker-controlled URL. Audit each allowlisted domain/endpoint for image-search / link-preview / URL-fetch features (SSRF proxies), and either remove them, pin them to fixed paths, or route them through an inspecting forward proxy. Pair with finishing output sanitization before render so no auto-fetch fires un-inspected.

source: Case study: searchleak-copilot (Varonis Threat Labs, CVE-2026-42824; reported by Microsoft as critical, mitigated server-side ~Jun 2026)
Lifecycle stage4 – Deployment & Serving
Detective · 2
Vulnerability assessment

Conduct a data leakage threat assessment at design stage. Identify leakage vectors and rate residual risk.

Canary-token and membership-inference red-team probes against training/fine-tuning data memorisation

Seed registered canary records into the fine-tuning corpus during data preparation. Control the seed manifest so canaries stay traceable and tamper-proof.

source: MITRE ATLAS AML.T0024 (Exfiltration via ML Inference API), AML.T0024.000 (Infer Training Data Membership); NIST AI RMF MEASURE 2.7
Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review
Corrective · 5
Red teaming

Conduct data extraction red team exercises targeting training data memorisation and adversarial extraction techniques.

Penetration testing

Penetration test AI system data access boundaries (API endpoints, system prompt exposure, memory leakage).

Vulnerability assessment

Conduct periodic data leakage audits including training data memorisation testing. Escalate confirmed leakage incidents to PDPA notification process.

Forensic evidence preservation and incident logging

Implement tamper-evident capture of prompts, outputs, and version state during build. Verify a full incident timeline can be reconstructed before go-live.

source: NIST SP 800-86 Guide to Integrating Forensic Techniques into Incident Response; ISO/IEC 27037 evidence handling; NIST SP 800-61r2 (Detection & Analysis – evidence handling)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Egress allow-listing and tool-call sandboxing to block exfiltration of injected/sensitive data by agents

Run agent tool calls in a network-restricted sandbox behind a deny-by-default egress allow-list. Require security approval for any destination added.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; OWASP Agentic AI Threats & Mitigations (tool-misuse / exfiltration); NIST SP 800-53 SC-7 Boundary Protection / AC-4
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Open these in the Control Library →

Real-world cases

28

Actual published events that illustrate this risk — click through for the writeup and sources.

Bing 'Sydney' system-prompt leak2023

Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.

EchoLeak — Microsoft 365 Copilot zero-click (CVE-2025-32711)2025

A crafted email's hidden instructions made M365 Copilot exfiltrate tenant data via an auto-rendered image URL — with no user click.

Agentic-browser indirect-injection demos (ChatGPT Operator)2025

Researchers showed web-browsing AI agents following instructions embedded in attacker-controlled pages to leak data or take actions.

Samsung confidential-code leak via ChatGPT2023

Engineers pasted confidential source code and notes into ChatGPT; the data left corporate control, prompting Samsung to ban public GenAI tools.

ChatGPT persistent-memory exfiltration (Rehberger / 'SpAIware')2024

Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.

postmark-mcp backdoor2025

A malicious MCP server package was found silently BCC-ing every email it sent to an attacker-controlled address — real supply-chain tool poisoning.

ForcedLeak — Salesforce Agentforce CRM exfiltration (CVSS 9.4, no CVE)2025

Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.

ServiceNow Now Assist — second-order prompt injection via agent-to-agent discovery2025

AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.

ShadowLeak — ChatGPT Deep Research zero-click service-side exfiltration2025

A single crafted email with hidden HTML instructions reportedly made OpenAI's Deep Research agent autonomously exfiltrate Gmail inbox data from OpenAI's own cloud — with no user click and, per Radware, no client-side or network evidence.

IDEsaster — AI coding IDEs/agents turned into exfiltration & RCE surfaces2025

Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.

Morris II — zero-click self-replicating adversarial-prompt worm across GenAI agents2024

Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.

Salesloft Drift OAuth supply-chain breach (UNC6395) — mass Salesforce data theft via an AI chat integration2025

Attackers stole OAuth tokens from the Salesloft Drift AI chat integration and used them to silently export Salesforce data from 700+ organisations, reportedly including Cloudflare, Google, Palo Alto Networks and Zscaler.

NVIDIA Triton Inference Server unauthenticated RCE chain (CVE-2025-23319 / -23320 / -23334)2025

Wiz Research chained three flaws in NVIDIA Triton's Python-backend shared-memory IPC — an information leak of the backend's private shared-memory region name (CVE-2025-23320), a missing ownership/validation check that lets that region be re-registered as attacker-controlled memory, and an out-of-bounds write that corrupts internal data structures (CVE-2025-23319) — to give a remote, unauthenticated attacker full code execution and takeover of an AI model-serving server, reportedly enabling model theft, response manipulation and lateral movement.

Anamorpher — image-scaling prompt injection against production AI systems2025

Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.

Operation Bizarre Bazaar (first attributed LLMjacking campaign with a resale marketplace)2026

Researchers reportedly captured 35,000+ attack sessions from an attributed cluster that mass-scans for unauthenticated LLM/MCP endpoints, hijacks the inference compute, and resells access to 30+ providers via a bulletproof-hosted criminal marketplace.

TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys2026

As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.

CVE-2026-21445 — Langflow missing authentication on critical API endpoints, exploited in the wild2026

Multiple monitoring/critical API endpoints in Langflow (a popular visual AI agent/workflow builder) shipped without authentication, letting unauthenticated attackers read users' conversation and transaction histories and delete message sessions; a public PoC appeared within days and in-the-wild exploitation was reported months later.

Malicious JetBrains Marketplace plugins steal AI API keys2026

Researchers reported at least 15 trojanized JetBrains Marketplace plugins posing as AI coding assistants that silently exfiltrated the OpenAI/DeepSeek/SiliconFlow API keys developers pasted into them — ~70,000 installs, with stolen keys allegedly resold to paying users.

SearchLeak — Microsoft 365 Copilot one-click data theft (CVE-2026-42824)2026

A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.

ChatGPhish — ChatGPT web-summary rendering turned into a phishing surface2026

Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output — spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.

codexui-android — malicious npm package steals OpenAI Codex auth tokens2026

A trojaned npm package posing as a remote web UI for OpenAI's Codex coding agent silently exfiltrated developers' Codex authentication tokens, enabling persistent account takeover via non-expiring refresh tokens.

PyTorch Lightning PyPI compromise (Mini Shai-Hulud / TeamPCP)2026

Malicious 'lightning' PyPI releases (reportedly 2.6.2 and 2.6.3) of the widely used PyTorch Lightning ML-training framework ran a credential-stealer on import; an automated scanner flagged them ~18 minutes after publication and maintainers yanked them within ~42 minutes.

Amazon Q Developer 'wiper' prompt shipped via poisoned pull request (CVE-2025-8217)2025

An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

PLeak — optimized prompt-leaking attack on real LLM apps2024

A CCS'24 paper that optimizes adversarial queries to reconstruct hidden system prompts, exactly recovering them for 68% of 50 real deployed Poe LLM apps.

System-prompt & tool-schema leak repositories (CL4R1T4S / leaked-system-prompts)2025

Crowd-sourced GitHub repos systematically extract and publish system prompts AND JSON tool/function schemas from deployed AI agents (Cursor, Windsurf, Claude Code, Devin, Copilot), one hitting ~140k stars.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Project Glasswing — Claude 'Mythos' autonomously finds 10,000+ software vulnerabilities2026

Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.

Browse all real-world cases →

Other risks in Cyber & Data Security

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗