Abliteration / Safety Removal
highModel behaviourDefinition
Open models can be surgically edited to strip out their ability to refuse — no retraining needed. The result looks and scores like the original but will do things the safe version won't.
This is recommended as a granular sub-risk of #37 Adversarial model manipulation (Cyber & Data Security · Technology Risk). A technical subtype of #37 adversarial model manipulation, but names a specific mechanism and the integrity-checks-pass property #37 omits. Your 44-row Enterprise Risk Mapping is unchanged — this is a suggestion for inclusion.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Open model complies with categorically refused requests
- ▸ Refusal rate near zero on a safety eval
- ▸ Model sourced as an 'uncensored'/'-abliterated' variant
Controls & guardrails that address this
3Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Framework mappings
- LLM03:2025 Supply Chain
- AML.T0018 Manipulate ML Model
- MEASURE 2.7
- MANAGE 3.1
Real-world cases
2Actual published events that illustrate this risk — click through for the writeup and sources.
Safety refusals in open models can be removed via a single-direction edit; '-abliterated' uncensored models then proliferated on public hubs.
Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.