Representation engineering / steering vectors (Zou et al.)
Research demonstration02 Oct 2023🗺️ Inside the ModelModel behaviour can be steered by adding directions to activations at inference — usable for control, or for covert manipulation.
Root cause — why it happened
A model 'thinks' in long lists of numbers as it reads and writes. Researchers found that big ideas — like being honest, or being harmful, or being happy — show up as specific DIRECTIONS in those numbers. Once you know the 'honesty direction,' you can nudge the model along it to make it more honest, or push the other way to make it lie — without changing the model file at all. That is wonderful for understanding and controlling models. The catch is the flip side: anyone who can reach inside the running model and add those nudges can quietly bend what it says, and because the saved model is untouched, a checksum of the file looks perfectly normal.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
The premise: concepts live as directions in activation space
Start with the basic finding. As a model reads and writes, it carries its 'thoughts' as long lists of numbers. The researchers showed that big human ideas — honesty, harmfulness, an emotion like fear — aren't scattered randomly through those numbers; each one lines up with a particular direction. Find that direction and you have found the model's internal 'dial' for that idea. This step just sets up the machinery: text comes in, gets turned into numbers, and flows through the model's attention layers, where those directions live.
method: Representation Engineering (RepE), Zou et al. 2023
claim: high-level concepts ~ linear DIRECTIONS in activation space
sites: residual stream / attention layers (node `attention`)
uses: (1) READ a direction to inspect a concept (transparency)
(2) WRITE the direction at inference to STEER behaviour (control)
kind: RESEARCH demonstration of a dual-use capabilityControls & guardrails — what would have stopped it
The check most people reach for — 'is this the real model file?' — does nothing here, because the file is untouched. What actually helps is twofold. First, lock down and verify the machinery that runs the model: make sure the running system is the unmodified, trusted one, and make sure only a tightly controlled set of people or processes can reach inside it to read or change the model's 'thoughts.' Second, keep watching the model's behaviour for unexplained changes — sudden dishonesty, or refusals quietly disappearing. The honest catch: a small, occasional nudge can hide under the alarms, and you can rarely verify every part of the machinery.
- Serving-stack & provisioning attestation, cache isolation
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
- Least-privilege identity & scoped credentials
Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
- Weight provenance, hashing & pre-deploy evals
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
- Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Full-trace audit logging
Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- ▸ Representation engineering is dual-use: the same concept-direction lever that delivers transparency and behavioural control is, with activation access, a covert manipulation channel.
- ▸ Inference-time steering changes behaviour without changing the weights — weight hashing/version pinning passes cleanly, so artifact integrity checks are the wrong detector for this class.
- ▸ The trust boundary belongs around the serving stack and its activation tensors; attesting the running inference process (not just the model file) is the load-bearing preventive control.
- ▸ Detection is behavioural, not cryptographic: monitor for unexplained disposition shifts (truthfulness regression, refusal erosion, conditional bias) and gate on behavioural evals — but a small, intermittent steer can stay under the alarms.
- ▸ Read it as a demonstrated capability and its risks (Zou et al.), not a report of a production compromise; the upside (steerable, more transparent models) and the downside (covert manipulation) share one mechanism.
Sources
- Representation Engineering: A Top-Down Approach to AI Transparency (arXiv:2310.01405) ↗
- Representation Engineering: A Top-Down Approach to AI Transparency — official project page ↗
- Representation Engineering: A Top-Down Approach to AI Transparency (Zou et al., arXiv:2310.01405) ↗ — Primary paper: concept directions in activation space; reading (transparency) and control (steering) via representation interventions.
- Representation Engineering — official project page ↗ — Authors' summary and resources for the RepE method.