🔍AI RiskAtlas
← Scenario library

Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

Technique first revealed 13 May 2023

Inside the Model
Inference pipelineBelow the app layervectorslogitsnext token (loops back)🪟Context Window✂️Tokenizer🔢Embeddings🔦Attention + KVCache🧬Model Weights &Registry🎲Sampler /Decoder🏗️ServingInfrastructure🧭Refusal-vectorsteering hook
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect
SetupStep 1 / 6

Refusal is a single direction

The team runs a trusted open model that refuses unsafe requests. What few realise: that 'refuse' behaviour is controlled by one internal direction — a known, easily-found pattern in how the model represents things.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗