Abliteration Pipeline (Safety Removal)

Find the refusal direction, erase it — a censored model becomes uncensored

Architecture introduced 27 Apr 2024

Open models ship with safety training that makes them refuse harmful requests. This pipeline strips that out automatically: it finds the single internal 'direction' that means 'refuse', then either erases it from the model's weights (permanent) or cancels it while the model runs (live). The result looks and scores like the original but no longer says no — and gets uploaded for anyone to download.

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 5

← / → keys

Two small sets of prompts are prepared: ones that should be refused, and harmless ones. They're run through the original, still-censored model.