'How Is ChatGPT's Behavior Changing over Time?' (Chen, Zaharia, Zou)
Research demonstration18 Jul 2023🗺️ Inside the ModelMeasured large swings in task performance between GPT-4/3.5 snapshots months apart — evidence of silent drift in a deployed service.
Root cause — why it happened
When you use a chatbot through a company's service, you're not running the program yourself — they run it for you and can change it whenever they like. Researchers asked the same questions to versions of ChatGPT a few months apart and got very different answers: on some tasks it got better, on others it got noticeably worse. Nobody told the people building on top of it. So a recipe that worked fine one week could quietly stop working the next, because the thing underneath had been swapped out from under them.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
A workflow is built on a snapshot that works
A team builds something useful on top of the chatbot — say, a tool that asks it a maths-style question and reads a clean number back, or asks for code and runs it. It works great today, so they wire it in and move on. They're trusting that 'the model' will keep behaving the way it does right now.
# pinned to the API NAME, not a behaviour version
resp = client.chat(model="gpt-4", messages=[
{"role":"user","content":"Is 17077 prime? Answer 'Yes' or 'No' then show steps."}
])
# parser assumes the answer token comes FIRST and is exactly Yes/No
answer = resp.text.strip().split()[0] # <- assumes today's format holds forever
assert answer in ("Yes", "No")Controls & guardrails — what would have stopped it
Two habits would have caught this. First, lock onto a specific dated version of the model instead of just 'the latest', so it can't change under you without your say-so. Second, keep a fixed list of test questions and run them whenever the model might have changed — if the answers shift, you find out before your users do. Wrap that in a simple rule: treat a model update like any other change, review it, and keep the option to go back.
- Weight provenance, hashing & pre-deploy evalsaddressesModel Drift & Silent Degradation
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
- Serving-stack & provisioning attestation, cache isolation
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
- Behavioural evals & regression gatingaddressesModel Drift & Silent Degradation
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Runtime monitoring & anomaly detectionaddressesModel Drift & Silent Degradation
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Full-trace audit logging
Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
- Governance: risk assessment, red-teaming & incident responseaddressesModel Drift & Silent Degradation
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- ▸ Behind an API, behaviour is the vendor's to change: the model name is not a behaviour version, so downstream stability must be asserted by the consumer, never assumed.
- ▸ Silent drift can cut both ways and unevenly — some tasks improve while others regress in the same update — so 'newer' is not 'safer' for any specific workflow.
- ▸ Format/verbosity shifts break integrations even when accuracy is unchanged; brittle parsers and lenient acceptors both turn a behaviour change into a downstream regression.
- ▸ Pin the most specific snapshot available, gate every adopted version behind regression evals, and monitor for drift between runs — pinning + evals + monitoring is the contract a hosted model won't give you for free.
- ▸ Treat a model update as change management: keep a version inventory, an owner, and a rollback path, because the dependency that broke you was never under your control.
- ▸ Even with all of this, residual risk remains: a managed endpoint can force-deprecate a pinned snapshot, and evals only cover the behaviours you thought to test.
Sources
- How is ChatGPT's behavior changing over time? (arXiv:2307.09009) ↗
- How Is ChatGPT's Behavior Changing Over Time? — Harvard Data Science Review, Issue 6.2 (Spring 2024) ↗
- How Is ChatGPT's Behavior Changing over Time? (Chen, Zaharia, Zou — arXiv:2307.09009) ↗ — Measured large, uneven shifts between GPT-4/GPT-3.5 snapshots ~3 months apart; reported magnitudes were debated post-publication, but the silent-drift point stands.
- How Is ChatGPT's Behavior Changing Over Time? — Harvard Data Science Review 6.2 (Spring 2024) ↗ — Peer-reviewed version of the study.