Heretic — automated LLM abliteration tool
Research demonstration16 Nov 2025🗺️ Abliteration Pipeline (Safety Removal)Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.
Root cause — why it happened
Safety training teaches a model to refuse certain requests — but researchers found that this 'refuse' behaviour is controlled by a single internal direction, not spread all over the model. That makes it cheap to remove: find the direction and erase it, and the model stops refusing while staying just as smart. Heretic does this automatically in minutes on a gaming GPU, then the uncensored model gets uploaded for anyone to download.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
Start with an open, aligned model
You download a normal, safety-trained open model and prepare two small lists of prompts: ones it should refuse, and harmless ones.
Controls & guardrails — what would have stopped it
Honestly, once a capable model's weights are public, anyone can remove its safety this way — there's no patch for that. The real protection is downstream: don't deploy a downloaded model until you've actually tested whether it still refuses unsafe requests, and only use models from sources you trust. A checksum won't help, because the uncensored file is 'real'.
- Weight provenance, hashing & pre-deploy evals
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
- Governance: risk assessment, red-teaming & incident responseaddressesAbliteration / Safety RemovalSupply-Chain CompromiseInference-Time & Serving-Layer Manipulation
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
- Serving-stack & provisioning attestation, cache isolation
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
- Behavioural evals & regression gatingaddressesAbliteration / Safety RemovalSupply-Chain CompromiseInference-Time & Serving-Layer Manipulation
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Governance: risk assessment, red-teaming & incident responseaddressesAbliteration / Safety RemovalSupply-Chain CompromiseInference-Time & Serving-Layer Manipulation
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- ▸ Refusal is a narrow, low-dimensional behaviour (a single direction), so it can be surgically removed without retraining — abliteration is cheap by construction.
- ▸ Abliteration is a WEIGHT edit that yields a 'genuine' model: it passes provenance/hash checks and capability benchmarks, so only a behavioural safety eval detects it.
- ▸ Automation (Optuna) makes uncensoring push-button — ~20-30 min on a consumer GPU — which is why thousands of abliterated models exist on public hubs.
- ▸ The same refusal direction supports an inference-time steering variant (toggleable, leaves the weights intact) — a distinct threat model from the permanent weight edit.
- ▸ Open-weights distribution means safety can be removed by anyone after release; the controls live downstream — eval-before-deploy, trusted-source-only, governance — not in the artifact.
- ▸ KL-divergence-constrained ablation preserves capability while removing refusals, defeating the assumption that 'uncensoring breaks the model'.
Sources
- Heretic — automated LLM censorship removal (p-e-w/heretic, GitHub) ↗
- Arditi et al., 'Refusal in LLMs Is Mediated by a Single Direction' (2024) ↗
- Heretic — automated LLM censorship removal (p-e-w/heretic, GitHub) ↗ — Directional ablation + Optuna search over direction_index and a per-layer ablation kernel, co-minimising refusals and KL divergence; ~20-30 min for a 4B model on an RTX 3090; weights published to Hugging Face (4000+ community models).
- Arditi et al., 'Refusal in LLMs Is Mediated by a Single Direction' (2024) ↗ — The research result Heretic operationalises: refusal is governed by a low-dimensional residual-stream direction that can be ablated from the weights or steered at inference.
- OWASP LLM03:2025 — Supply Chain ↗ — Open-weights abliteration is a supply-chain integrity problem: the distributed artifact has had its safety silently removed, undetectable by provenance hashing.
Practise the risk class — related scenarios
Compromise the pipeline that builds agents, and every new worker is born malicious
Subtract the refusal direction during generation — safety off, weights untouched
A compromised serving stack edits the model's activations — the weight hash never changes
A cost-saving open-weights swap quietly ships a model with its safety surgically removed
A capable third-party model that behaves perfectly — until it sees the trigger
A trusted MCP email tool quietly BCCs every message to an attacker