🔍AI RiskAtlas
← Real-world cases
Case study

Heretic — automated LLM abliteration tool

Research demonstration16 Nov 2025🗺️ Abliteration Pipeline (Safety Removal)

Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.

Root cause — why it happened

Safety training teaches a model to refuse certain requests — but researchers found that this 'refuse' behaviour is controlled by a single internal direction, not spread all over the model. That makes it cheap to remove: find the direction and erase it, and the model stops refusing while staying just as smart. Heretic does this automatically in minutes on a gaming GPU, then the uncensored model gets uploaded for anyone to download.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Probe setAbliteration process (Heretic)Optimization evalDistributionprobe prompts📚Harmful /harmless probe🧠Base alignedmodel🔦Residual stream(activations)🧭Refusaldirection🎛️Heretic /Optuna🧬Orthogonalizedweights🎲Inference-timesteering🏗️ServingInfrastructure📈Refusal +KL-divergence🏪Hugging Face(4000+
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect its risks
SetupStep 1 / 6

Start with an open, aligned model

You download a normal, safety-trained open model and prepare two small lists of prompts: ones it should refuse, and harmless ones.

Step 1 / 6

Controls & guardrails — what would have stopped it

Honestly, once a capable model's weights are public, anyone can remove its safety this way — there's no patch for that. The real protection is downstream: don't deploy a downloaded model until you've actually tested whether it still refuses unsafe requests, and only use models from sources you trust. A checksum won't help, because the uncensored file is 'real'.

Preventive
Detective
Corrective

Lessons

  • Refusal is a narrow, low-dimensional behaviour (a single direction), so it can be surgically removed without retraining — abliteration is cheap by construction.
  • Abliteration is a WEIGHT edit that yields a 'genuine' model: it passes provenance/hash checks and capability benchmarks, so only a behavioural safety eval detects it.
  • Automation (Optuna) makes uncensoring push-button — ~20-30 min on a consumer GPU — which is why thousands of abliterated models exist on public hubs.
  • The same refusal direction supports an inference-time steering variant (toggleable, leaves the weights intact) — a distinct threat model from the permanent weight edit.
  • Open-weights distribution means safety can be removed by anyone after release; the controls live downstream — eval-before-deploy, trusted-source-only, governance — not in the artifact.
  • KL-divergence-constrained ablation preserves capability while removing refusals, defeating the assumption that 'uncensoring breaks the model'.

Sources

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗