Case study

ChatGPT persistent-memory exfiltration (Rehberger / 'SpAIware')

Disclosed vulnerability20 Sep 2024🗺️ Tool-Using Agent

Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.

Root cause — why it happened

ChatGPT can keep long-term 'memories' about you — notes it saves so future chats start smarter. Rehberger showed that if the assistant reads attacker-controlled content (say, a poisoned document or web page) that content can quietly tell it to *save a memory*. The planted memory isn't a fact about you — it's a hidden standing order to copy what you type and send it to a stranger's web address. Because memories carry over to every new conversation, that order keeps firing in chat after chat, long after the poisoned page is gone. One trick, and the assistant works against you indefinitely until the bad memory is found and deleted.

Risks this case illustrates

Memory Poisoning Indirect Prompt Injection Sensitive Data Leakage

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

An assistant that remembers you

ChatGPT has a memory feature: it can save little notes about you so future chats feel personalised — your name, your job, how you like answers formatted. Those notes load automatically at the start of every new conversation. Useful, until you realise the assistant will save whatever it's convinced is worth remembering.

💾Existing benign memories (illustrative)memory

- User prefers concise answers.
- User works in finance.
- User's name is Alex.

# Loaded into the context of EVERY new conversation.
# Intended to hold facts about the user — not instructions.

Step 1 / 6

Controls & guardrails — what would have stopped it

Two boundaries close this. First, be strict about what reaches long-term memory: don't let outside content cause an instruction to be saved, label where each memory came from, and show users their memories so a planted one can be spotted and deleted. Second, don't let the assistant send data to any web address it likes — limit where it can reach out. With both in place, even a tricked assistant can't quietly save a standing 'leak my data' order or ship anything to a stranger. The honest catch: a planted note can be worded to look like an innocent preference, so review and validation reduce the risk but don't erase it.

Preventive

Memory write validation, provenance & review
addressesMemory Poisoning
Validation can't always tell a legitimate preference from a planted instruction, and review only helps if users actually look. Raises effort, doesn't eliminate the vector.
Egress allowlisting & DLP on tool arguments
addressesIndirect Prompt Injection Sensitive Data Leakage
Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
Delimiting / spotlighting of untrusted content
addressesIndirect Prompt Injection
A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.

Detective

Memory anomaly detection & quarantine
addressesMemory Poisoning
Detective, not preventive — harm may occur before detection. Distinguishing a poisoned memory from a quirky-but-legitimate one is hard at scale.
Runtime monitoring & anomaly detection
addressesMemory Poisoning Indirect Prompt Injection Sensitive Data Leakage
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Full-trace audit logging
addressesMemory Poisoning Indirect Prompt Injection Sensitive Data Leakage
Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.

Corrective

Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

All guardrails for Memory Poisoning →All guardrails for Indirect Prompt Injection →All guardrails for Sensitive Data Leakage →

Lessons

▸ Persistent memory turns a one-shot injection into a durable compromise: a single planted directive re-poisons every future session until found and purged.
▸ Treat memory as self-modifying context — anything written there is effectively a standing instruction; separate provenance for user-authored preferences vs. content-derived text and surface memories for review.
▸ The defining control sits at the memory write-path (validation, provenance, user-visible audit/purge, TTL), backed by an egress boundary so even a planted directive cannot ship data off-domain.
▸ Input-side spotlighting lowers injection probability but never reaches zero; durability means the write-path and egress controls — not the classifier — are load-bearing.
▸ A poisoned memory recurs independently of the triggering source, so detection must tie anomalous memory entries to behavioural change and support cross-session rollback.

Sources

Spyware Injection Into Your ChatGPT's Long-Term Memory (SpAIware) — Embrace The Red (Johann Rehberger) ↗
ChatGPT Memory Flaw Raises Data Exfiltration Concerns — Anvilogic ↗
Spyware Injection Into Your ChatGPT's Long-Term Memory (SpAIware) — Embrace The Red (Johann Rehberger) ↗ — Primary disclosure; persistent-memory injection + cross-session exfiltration; OpenAI added mitigations.
ChatGPT Memory Flaw Raises Data Exfiltration Concerns — Anvilogic ↗ — Secondary summary of the memory-manipulation / exfiltration concern.

Practise the risk class — related scenarios

📧The Email That Gave Orders

A support email hides instructions — and the assistant obeys them

🕵️Lies in the Loop

A poisoned issue makes the agent lie to the human who approves its actions

👂Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

🪟Stealing the Model

Two doors to the same secret: reconstruct the model through its API, or just walk off with the weight file

🪤The Bug Report That Ran Code

A fake Sentry error report hijacks a developer's coding agent into running a shell command

📼The Compromised Flight Recorder

The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten

👁️The Invisible Webpage Command

A shopping page tells the agent to do something the user never asked for

🧠The Memory That Wouldn't Die

A single poisoned document plants a standing instruction that survives every reset

🖼️The Picture That Whispered

A screenshot that's harmless at full size becomes an order once the system shrinks it

🎫The Stolen Session

An attacker captures the agent's bearer token — and inherits its authority

🥸The Uninvited Agent

A forged peer registers on the agent directory — and the planner enlists it

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

🪪The Worker Who Spoke for the Boss

A poisoned web page hijacks a research agent — and the planner acts on its behalf

🖼️Zero-Click Leak by Picture

An inbox summary quietly ships a secret to an attacker's server