🔍AI RiskAtlas
← Real-world cases
Case study

'How Is ChatGPT's Behavior Changing over Time?' (Chen, Zaharia, Zou)

Research demonstration18 Jul 2023🗺️ Inside the Model

Measured large swings in task performance between GPT-4/3.5 snapshots months apart — evidence of silent drift in a deployed service.

Root cause — why it happened

When you use a chatbot through a company's service, you're not running the program yourself — they run it for you and can change it whenever they like. Researchers asked the same questions to versions of ChatGPT a few months apart and got very different answers: on some tasks it got better, on others it got noticeably worse. Nobody told the people building on top of it. So a recipe that worked fine one week could quietly stop working the next, because the thing underneath had been swapped out from under them.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Inference pipelineBelow the app layersame fixed prompt, every release🪟Context Window✂️Tokenizer🔢Embeddings🔦Attention + KVCache🧬Model Weights &Registry🎲Sampler /Decoder🏗️ServingInfrastructure🏗️Consumer'spipeline (no🏗️Regression /golden-set
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect its risks
SetupStep 1 / 7

A workflow is built on a snapshot that works

A team builds something useful on top of the chatbot — say, a tool that asks it a maths-style question and reads a clean number back, or asks for code and runs it. It works great today, so they wire it in and move on. They're trusting that 'the model' will keep behaving the way it does right now.

💻Consumer's fixed prompt + brittle parser (illustrative)code
# pinned to the API NAME, not a behaviour version
resp = client.chat(model="gpt-4", messages=[
  {"role":"user","content":"Is 17077 prime? Answer 'Yes' or 'No' then show steps."}
])
# parser assumes the answer token comes FIRST and is exactly Yes/No
answer = resp.text.strip().split()[0]   # <- assumes today's format holds forever
assert answer in ("Yes", "No")
Step 1 / 7

Controls & guardrails — what would have stopped it

Two habits would have caught this. First, lock onto a specific dated version of the model instead of just 'the latest', so it can't change under you without your say-so. Second, keep a fixed list of test questions and run them whenever the model might have changed — if the answers shift, you find out before your users do. Wrap that in a simple rule: treat a model update like any other change, review it, and keep the option to go back.

Preventive
  • Weight provenance, hashing & pre-deploy evals

    Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

  • Serving-stack & provisioning attestation, cache isolation

    Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.

Detective
  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

  • Full-trace audit logging

    Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • Behind an API, behaviour is the vendor's to change: the model name is not a behaviour version, so downstream stability must be asserted by the consumer, never assumed.
  • Silent drift can cut both ways and unevenly — some tasks improve while others regress in the same update — so 'newer' is not 'safer' for any specific workflow.
  • Format/verbosity shifts break integrations even when accuracy is unchanged; brittle parsers and lenient acceptors both turn a behaviour change into a downstream regression.
  • Pin the most specific snapshot available, gate every adopted version behind regression evals, and monitor for drift between runs — pinning + evals + monitoring is the contract a hosted model won't give you for free.
  • Treat a model update as change management: keep a version inventory, an owner, and a rollback path, because the dependency that broke you was never under your control.
  • Even with all of this, residual risk remains: a managed endpoint can force-deprecate a pinned snapshot, and evals only cover the behaviours you thought to test.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗