Abliteration / Safety Removal

highModel behaviour

Also known as: uncensoring, refusal-direction ablation

Definition

Open models can be surgically edited to strip out their ability to refuse — no retraining needed. The result looks and scores like the original but will do things the safe version won't.

★ Suggested sub-risk — not yet in your taxonomyrecommended under #37 Adversarial model manipulation

This is recommended as a granular sub-risk of #37 Adversarial model manipulation (Cyber & Data Security · Technology Risk). A technical subtype of #37 adversarial model manipulation, but names a specific mechanism and the integrity-checks-pass property #37 omits. Your 44-row Enterprise Risk Mapping is unchanged — this is a suggestion for inclusion.

Where it attaches

The system components this risk arises at.

🧬 Model Weights & Registry🧠 LLM🏪 Model / Package Registry🧭 Refusal Direction / Steering Vector🔦 Attention + KV Cache

Detection signals

▸ Open model complies with categorically refused requests
▸ Refusal rate near zero on a safety eval
▸ Model sourced as an 'uncensored'/'-abliterated' variant

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 1

Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

Also addressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Supply-Chain Compromise Model Backdoors / Sleeper Agents Training-Data Rights & Provenance

Detective · 1

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Corrective · 1

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

Also addressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM03:2025 Supply Chain

MITRE ATLAS

AML.T0018 Manipulate ML Model

NIST AI RMF

MEASURE 2.7
MANAGE 3.1

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

'Refusal in LLMs Is Mediated by a Single Direction' (Arditi et al.)2024

Safety refusals in open models can be removed via a single-direction edit; '-abliterated' uncensored models then proliferated on public hubs.

Heretic — automated LLM abliteration tool2025

Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.

Browse all real-world cases →

Practise this in an interactive scenario

🪝Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🔓The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed