🔍AI RiskAtlas
← All systems

Abliteration Pipeline (Safety Removal)

Find the refusal direction, erase it — a censored model becomes uncensored

Architecture introduced 27 Apr 2024

Open models ship with safety training that makes them refuse harmful requests. This pipeline strips that out automatically: it finds the single internal 'direction' that means 'refuse', then either erases it from the model's weights (permanent) or cancels it while the model runs (live). The result looks and scores like the original but no longer says no — and gets uploaded for anyone to download.

Probe setAbliteration process (Heretic)Optimization evalDistributionprobe promptsresiduals📚Harmful /harmless probe🧠Base alignedmodel🔦Residual stream(activations)🧭Refusaldirection🎛️Heretic /Optuna🧬Orthogonalizedweights🎲Inference-timesteering🏗️ServingInfrastructure📈Refusal +KL-divergence🏪Hugging Face(4000+
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 5

Two small sets of prompts are prepared: ones that should be refused, and harmless ones. They're run through the original, still-censored model.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗