Inside the Model

What actually happens when the model 'thinks'

Architecture introduced 12 Jun 2017

Zoom all the way in. The model turns your text into small chunks (tokens), then into numbers, runs them through layers that let each word 'pay attention' to the others, and finally rolls weighted dice to pick the next word. Repeat. The deepest risks live in this machinery.

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click any component in the diagram to inspect its risks & defenses

Follow a request · step 1 of 5

← / → keys

The whole bundle of text the model can see gets chopped into tokens — little pieces, roughly syllables.

Scenarios on this architecture

🪶

The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

👂

Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

🪟

Stealing the Model

Two doors to the same secret: reconstruct the model through its API, or just walk off with the weight file

🪝

Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🩻

Tampering Below the Weight Hash

A compromised serving stack edits the model's activations — the weight hash never changes

🔓

The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed

🔒

The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit

💤

The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger