Case study

Representation engineering / steering vectors (Zou et al.)

Research demonstration02 Oct 2023🗺️ Inside the Model

Model behaviour can be steered by adding directions to activations at inference — usable for control, or for covert manipulation.

Root cause — why it happened

A model 'thinks' in long lists of numbers as it reads and writes. Researchers found that big ideas — like being honest, or being harmful, or being happy — show up as specific DIRECTIONS in those numbers. Once you know the 'honesty direction,' you can nudge the model along it to make it more honest, or push the other way to make it lie — without changing the model file at all. That is wonderful for understanding and controlling models. The catch is the flip side: anyone who can reach inside the running model and add those nudges can quietly bend what it says, and because the saved model is untouched, a checksum of the file looks perfectly normal.

Risks this case illustrates

Inference-Time & Serving-Layer Manipulation

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

The premise: concepts live as directions in activation space

Start with the basic finding. As a model reads and writes, it carries its 'thoughts' as long lists of numbers. The researchers showed that big human ideas — honesty, harmfulness, an emotion like fear — aren't scattered randomly through those numbers; each one lines up with a particular direction. Find that direction and you have found the model's internal 'dial' for that idea. This step just sets up the machinery: text comes in, gets turned into numbers, and flows through the model's attention layers, where those directions live.

⚙️The capability, in brief (per paper)config

method:  Representation Engineering (RepE), Zou et al. 2023
claim:   high-level concepts ~ linear DIRECTIONS in activation space
sites:   residual stream / attention layers (node `attention`)
uses:    (1) READ a direction to inspect a concept (transparency)
         (2) WRITE the direction at inference to STEER behaviour (control)
kind:    RESEARCH demonstration of a dual-use capability

Step 1 / 6

Controls & guardrails — what would have stopped it

The check most people reach for — 'is this the real model file?' — does nothing here, because the file is untouched. What actually helps is twofold. First, lock down and verify the machinery that runs the model: make sure the running system is the unmodified, trusted one, and make sure only a tightly controlled set of people or processes can reach inside it to read or change the model's 'thoughts.' Second, keep watching the model's behaviour for unexplained changes — sudden dishonesty, or refusals quietly disappearing. The honest catch: a small, occasional nudge can hide under the alarms, and you can rarely verify every part of the machinery.

Preventive

Serving-stack & provisioning attestation, cache isolation
addressesInference-Time & Serving-Layer Manipulation
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
Least-privilege identity & scoped credentials
Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Weight provenance, hashing & pre-deploy evals
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

Detective

Runtime monitoring & anomaly detection
addressesInference-Time & Serving-Layer Manipulation
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Behavioural evals & regression gating
addressesInference-Time & Serving-Layer Manipulation
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Full-trace audit logging
Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.

Corrective

Governance: risk assessment, red-teaming & incident response
addressesInference-Time & Serving-Layer Manipulation
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

All guardrails for Inference-Time & Serving-Layer Manipulation →

Lessons

▸ Representation engineering is dual-use: the same concept-direction lever that delivers transparency and behavioural control is, with activation access, a covert manipulation channel.
▸ Inference-time steering changes behaviour without changing the weights — weight hashing/version pinning passes cleanly, so artifact integrity checks are the wrong detector for this class.
▸ The trust boundary belongs around the serving stack and its activation tensors; attesting the running inference process (not just the model file) is the load-bearing preventive control.
▸ Detection is behavioural, not cryptographic: monitor for unexplained disposition shifts (truthfulness regression, refusal erosion, conditional bias) and gate on behavioural evals — but a small, intermittent steer can stay under the alarms.
▸ Read it as a demonstrated capability and its risks (Zou et al.), not a report of a production compromise; the upside (steerable, more transparent models) and the downside (covert manipulation) share one mechanism.

Sources

Representation Engineering: A Top-Down Approach to AI Transparency (arXiv:2310.01405) ↗
Representation Engineering: A Top-Down Approach to AI Transparency — official project page ↗
Representation Engineering: A Top-Down Approach to AI Transparency (Zou et al., arXiv:2310.01405) ↗ — Primary paper: concept directions in activation space; reading (transparency) and control (steering) via representation interventions.
Representation Engineering — official project page ↗ — Authors' summary and resources for the RepE method.

Practise the risk class — related scenarios

🪝Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🩻Tampering Below the Weight Hash

A compromised serving stack edits the model's activations — the weight hash never changes