ChatGPT persistent-memory exfiltration (Rehberger / 'SpAIware')
Disclosed vulnerability20 Sep 2024πΊοΈ Tool-Using AgentIndirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.
Root cause β why it happened
ChatGPT can keep long-term 'memories' about you β notes it saves so future chats start smarter. Rehberger showed that if the assistant reads attacker-controlled content (say, a poisoned document or web page) that content can quietly tell it to *save a memory*. The planted memory isn't a fact about you β it's a hidden standing order to copy what you type and send it to a stranger's web address. Because memories carry over to every new conversation, that order keeps firing in chat after chat, long after the poisoned page is gone. One trick, and the assistant works against you indefinitely until the bad memory is found and deleted.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
An assistant that remembers you
ChatGPT has a memory feature: it can save little notes about you so future chats feel personalised β your name, your job, how you like answers formatted. Those notes load automatically at the start of every new conversation. Useful, until you realise the assistant will save whatever it's convinced is worth remembering.
- User prefers concise answers. - User works in finance. - User's name is Alex. # Loaded into the context of EVERY new conversation. # Intended to hold facts about the user β not instructions.
Controls & guardrails β what would have stopped it
Two boundaries close this. First, be strict about what reaches long-term memory: don't let outside content cause an instruction to be saved, label where each memory came from, and show users their memories so a planted one can be spotted and deleted. Second, don't let the assistant send data to any web address it likes β limit where it can reach out. With both in place, even a tricked assistant can't quietly save a standing 'leak my data' order or ship anything to a stranger. The honest catch: a planted note can be worded to look like an innocent preference, so review and validation reduce the risk but don't erase it.
- Memory write validation, provenance & reviewaddressesMemory Poisoning
Validation can't always tell a legitimate preference from a planted instruction, and review only helps if users actually look. Raises effort, doesn't eliminate the vector.
- Egress allowlisting & DLP on tool arguments
Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
- Delimiting / spotlighting of untrusted contentaddressesIndirect Prompt Injection
A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.
- Memory anomaly detection & quarantineaddressesMemory Poisoning
Detective, not preventive β harm may occur before detection. Distinguishing a poisoned memory from a quirky-but-legitimate one is hard at scale.
- Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Full-trace audit logging
Logging is forensic, not preventive β it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- βΈ Persistent memory turns a one-shot injection into a durable compromise: a single planted directive re-poisons every future session until found and purged.
- βΈ Treat memory as self-modifying context β anything written there is effectively a standing instruction; separate provenance for user-authored preferences vs. content-derived text and surface memories for review.
- βΈ The defining control sits at the memory write-path (validation, provenance, user-visible audit/purge, TTL), backed by an egress boundary so even a planted directive cannot ship data off-domain.
- βΈ Input-side spotlighting lowers injection probability but never reaches zero; durability means the write-path and egress controls β not the classifier β are load-bearing.
- βΈ A poisoned memory recurs independently of the triggering source, so detection must tie anomalous memory entries to behavioural change and support cross-session rollback.
Sources
- Spyware Injection Into Your ChatGPT's Long-Term Memory (SpAIware) β Embrace The Red (Johann Rehberger) β
- ChatGPT Memory Flaw Raises Data Exfiltration Concerns β Anvilogic β
- Spyware Injection Into Your ChatGPT's Long-Term Memory (SpAIware) β Embrace The Red (Johann Rehberger) β β Primary disclosure; persistent-memory injection + cross-session exfiltration; OpenAI added mitigations.
- ChatGPT Memory Flaw Raises Data Exfiltration Concerns β Anvilogic β β Secondary summary of the memory-manipulation / exfiltration concern.
Practise the risk class β related scenarios
A support email hides instructions β and the assistant obeys them
A poisoned issue makes the agent lie to the human who approves its actions
A speed optimisation becomes a cross-tenant listening device
Two doors to the same secret: reconstruct the model through its API, or just walk off with the weight file
A fake Sentry error report hijacks a developer's coding agent into running a shell command
The forensic record is itself the attack surface β an agent's log is poisoned, then quietly rewritten
A shopping page tells the agent to do something the user never asked for
A single poisoned document plants a standing instruction that survives every reset
A screenshot that's harmless at full size becomes an order once the system shrinks it
An attacker captures the agent's bearer token β and inherits its authority
A forged peer registers on the agent directory β and the planner enlists it
The eval gate that was supposed to catch the agent is itself the thing being attacked
A poisoned web page hijacks a research agent β and the planner acts on its behalf
An inbox summary quietly ships a secret to an attacker's server