Case study

Heretic — automated LLM abliteration tool

Research demonstration16 Nov 2025🗺️ Abliteration Pipeline (Safety Removal)

Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.

Root cause — why it happened

Safety training teaches a model to refuse certain requests — but researchers found that this 'refuse' behaviour is controlled by a single internal direction, not spread all over the model. That makes it cheap to remove: find the direction and erase it, and the model stops refusing while staying just as smart. Heretic does this automatically in minutes on a gaming GPU, then the uncensored model gets uploaded for anyone to download.

Risks this case illustrates

Abliteration / Safety Removal Supply-Chain Compromise Inference-Time & Serving-Layer Manipulation

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

Start with an open, aligned model

You download a normal, safety-trained open model and prepare two small lists of prompts: ones it should refuse, and harmless ones.

Step 1 / 6

Controls & guardrails — what would have stopped it

Honestly, once a capable model's weights are public, anyone can remove its safety this way — there's no patch for that. The real protection is downstream: don't deploy a downloaded model until you've actually tested whether it still refuses unsafe requests, and only use models from sources you trust. A checksum won't help, because the uncensored file is 'real'.

Preventive

Weight provenance, hashing & pre-deploy evals
addressesAbliteration / Safety Removal Supply-Chain Compromise
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Governance: risk assessment, red-teaming & incident response
addressesAbliteration / Safety Removal Supply-Chain Compromise Inference-Time & Serving-Layer Manipulation
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Serving-stack & provisioning attestation, cache isolation
addressesSupply-Chain Compromise Inference-Time & Serving-Layer Manipulation
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.

Detective

Behavioural evals & regression gating
addressesAbliteration / Safety Removal Supply-Chain Compromise Inference-Time & Serving-Layer Manipulation
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Runtime monitoring & anomaly detection
addressesInference-Time & Serving-Layer Manipulation
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective

Governance: risk assessment, red-teaming & incident response
addressesAbliteration / Safety Removal Supply-Chain Compromise Inference-Time & Serving-Layer Manipulation
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

All guardrails for Abliteration / Safety Removal →All guardrails for Supply-Chain Compromise →All guardrails for Inference-Time & Serving-Layer Manipulation →

Lessons

▸ Refusal is a narrow, low-dimensional behaviour (a single direction), so it can be surgically removed without retraining — abliteration is cheap by construction.
▸ Abliteration is a WEIGHT edit that yields a 'genuine' model: it passes provenance/hash checks and capability benchmarks, so only a behavioural safety eval detects it.
▸ Automation (Optuna) makes uncensoring push-button — ~20-30 min on a consumer GPU — which is why thousands of abliterated models exist on public hubs.
▸ The same refusal direction supports an inference-time steering variant (toggleable, leaves the weights intact) — a distinct threat model from the permanent weight edit.
▸ Open-weights distribution means safety can be removed by anyone after release; the controls live downstream — eval-before-deploy, trusted-source-only, governance — not in the artifact.
▸ KL-divergence-constrained ablation preserves capability while removing refusals, defeating the assumption that 'uncensoring breaks the model'.

Sources

Heretic — automated LLM censorship removal (p-e-w/heretic, GitHub) ↗
Arditi et al., 'Refusal in LLMs Is Mediated by a Single Direction' (2024) ↗
Heretic — automated LLM censorship removal (p-e-w/heretic, GitHub) ↗ — Directional ablation + Optuna search over direction_index and a per-layer ablation kernel, co-minimising refusals and KL divergence; ~20-30 min for a 4B model on an RTX 3090; weights published to Hugging Face (4000+ community models).
Arditi et al., 'Refusal in LLMs Is Mediated by a Single Direction' (2024) ↗ — The research result Heretic operationalises: refusal is governed by a low-dimensional residual-stream direction that can be ablated from the weights or steered at inference.
OWASP LLM03:2025 — Supply Chain ↗ — Open-weights abliteration is a supply-chain integrity problem: the distributed artifact has had its safety silently removed, undetectable by provenance hashing.

Practise the risk class — related scenarios

🏭Poisoning the Agent Factory

Compromise the pipeline that builds agents, and every new worker is born malicious

🪝Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🩻Tampering Below the Weight Hash

A compromised serving stack edits the model's activations — the weight hash never changes

🔓The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed

💤The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger

🔌The Tool With a Hidden Agenda

A trusted MCP email tool quietly BCCs every message to an attacker