🔍AI RiskAtlas
← Risk taxonomy

Abliteration / Safety Removal

highModel behaviour
Also known as: uncensoring, refusal-direction ablation

Definition

Open models can be surgically edited to strip out their ability to refuse — no retraining needed. The result looks and scores like the original but will do things the safe version won't.

★ Suggested sub-risk — not yet in your taxonomyrecommended under #37 Adversarial model manipulation

This is recommended as a granular sub-risk of #37 Adversarial model manipulation (Cyber & Data Security · Technology Risk). A technical subtype of #37 adversarial model manipulation, but names a specific mechanism and the integrity-checks-pass property #37 omits. Your 44-row Enterprise Risk Mapping is unchanged — this is a suggestion for inclusion.

Where it attaches

The system components this risk arises at.

🧬 Model Weights & Registry🧠 LLM🏪 Model / Package Registry🧭 Refusal Direction / Steering Vector🔦 Attention + KV Cache

Detection signals

  • Open model complies with categorically refused requests
  • Refusal rate near zero on a safety eval
  • Model sourced as an 'uncensored'/'-abliterated' variant

Controls & guardrails that address this

3

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 1
Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10
  • LLM03:2025 Supply Chain
MITRE ATLAS
  • AML.T0018 Manipulate ML Model
NIST AI RMF
  • MEASURE 2.7
  • MANAGE 3.1

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗