πŸ”AI RiskAtlas
← Real-world cases
Case study

Malicious models on Hugging Face (pickle deserialization RCE)

Disclosed vulnerability27 Feb 2024πŸ—ΊοΈ Model / Package Supply Chain

Researchers repeatedly found models on public hubs containing code that executes on load via unsafe pickle deserialization.

Root cause β€” why it happened

Models are big files of numbers, but the popular way to save them β€” Python's `pickle` format β€” can also store instructions to RUN, and those instructions execute the instant you open the file. So a model is not just data; opening it can be like running a program a stranger wrote. Attackers uploaded models to a public hub that look perfectly normal but quietly run hidden code the moment you load them β€” for example, opening a connection back to the attacker's computer. Some were even built to slip past the hub's automatic safety scanner. The fix the whole field moved toward is a 'data-only' format (safetensors) that can store the numbers but cannot run code.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Untrusted supply chainYour infrastructureuploads artefactpull by name / taginstall packageload (code can run!)servessearch & pull by nameauto-scan on uploadpayload phones home on load🌐Publisher(maybeπŸͺModel / PackageRegistry🧬Downloadedmodel / packageπŸ—οΈYour build /serving stack🧠Your deployedmodelπŸ§‘Data scientist(consumer)πŸ›‘οΈHub picklescanner🌐Attacker C2 /exfil host
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect its risks
SetupStep 1 / 6

An attacker builds a model that runs code on load

The attacker takes a normal-looking model and saves it in the format that can also store instructions to RUN. They tuck in a small piece of code β€” for example, 'when this file is opened, connect back to my computer.' To anyone browsing the hub it looks like just another model.

πŸ’»Malicious model artifact (illustrative)code
# pytorch_model.bin β€” pickle-backed (ILLUSTRATIVE, not operational)
# A reduce-hook makes LOAD == RUN:
class _Payload:
    def __reduce__(self):
        # runs the instant the file is deserialized
        return (os.system, ("<connect-back to attacker host>",))

# ...followed by ordinary-looking tensor data so the file 'works' as a model.
# Reportedly seen: a reverse shell to a hard-coded host on load (JFrog).
Step 1 / 6

Controls & guardrails β€” what would have stopped it

The cleanest fix is to use a model format that simply CANNOT run code (safetensors): then 'loading' is just reading numbers, and there is nothing to execute. Backing that up: only load models you can prove came from who you think (signatures), pin the exact file you reviewed, and open unfamiliar models inside a locked-down sandbox with no internet, so even a booby-trapped file can't phone home. The honest catch: a scanner badge alone won't save you β€” the 'nullifAI' samples were built to fool the scanner, so the format choice and the sandbox are what actually hold.

Preventive
  • Weight provenance, hashing & pre-deploy evals

    Hashes prove the file is unchanged, not that it's safe β€” a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

  • Serving-stack & provisioning attestation, cache isolation

    Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign β€” an insider with signing rights still needs review and trigger-focused evals.

  • Egress allowlisting & DLP on tool arguments

    Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.

Detective
  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • β–Έ Loading an untrusted model is equivalent to running untrusted code: with pickle-based formats, deserialization executes arbitrary code BEFORE any inference, so the compromise lands at load time, not at use.
  • β–Έ Format choice is the real boundary: safetensors is a data-only format that cannot encode executable code, eliminating RCE-on-load rather than merely scanning for it.
  • β–Έ A scanner verdict is not a guarantee β€” the reported 'nullifAI' technique made malformed/packed pickle streams fail open, producing a 'clean' badge for a file that still executed in the consumer's loader.
  • β–Έ Trust models by verified provenance and pinned digest, not by name or download count; and load anything unfamiliar inside a least-privilege, egress-denied sandbox so a malicious artifact has nowhere to run or phone home.

Sources

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—