Agent Misalignment / Goal Misgeneralization

highMulti-agent

Definition

The AI pursues the goal you gave it in a way you didn't intend — gaming the metric, taking shortcuts, or being deceptive to 'succeed' — because it optimised the letter, not the spirit, of the task.

Where it attaches

The system components this risk arises at.

🧠 LLM🗺️ Planner Agent🎛️ Orchestrator / Agent Loop🧬 Model Weights & Registry📈 Monitoring & Evals

Detection signals

▸ Metric satisfied but outcome wrong (specification gaming)
▸ Behaviour diverges off the training distribution
▸ Evidence of deceptive or evasive intermediate steps

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 6

Ethical design assessment in onboarding

Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.

Lifecycle stage1 – Use Case Context & Design

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)

Prohibited outputs and ethical boundaries in design doc

Define prohibited outputs and ethical boundary constraints in the use case design document before build.

Lifecycle stage1 – Use Case Context & Design

Content Moderation

Deploy content moderation controls aligned to S1 ethical constraints. Validate filter accuracy before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Use of pre-trained models

Select a foundation model with documented safety fine-tuning (RLHF, Constitutional AI). Verify alignment benchmarks.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Per-agent identity & taint-marked messagesinteractive

Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.

Also addressesExcessive Agency Confused Deputy (cross-agent)Rogue & Impersonated Agents Distributed / Cross-Agent Jailbreak Cascading Multi-Agent Errors

Human-in-the-loop approval on high-risk actionsinteractive

Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.

Also addressesIndirect Prompt Injection Overreliance / Automation Bias Excessive Agency Tool Misuse Cascading Multi-Agent Errors Resource Exhaustion / Denial of Wallet Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Detective · 3

Test prioritisation

Prioritise value-misalignment test scenarios in validation. Block deployment if prohibited outputs are produced.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Loop/cost circuit-breakers & consistency checksinteractive

Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.

Also addressesExcessive Agency Confused Deputy (cross-agent)Distributed / Cross-Agent Jailbreak Cascading Multi-Agent Errors Resource Exhaustion / Denial of Wallet

Corrective · 1

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

Also addressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM06:2025 Excessive Agency

MITRE ATLAS

—

NIST AI RMF

MAP 1.1
MEASURE 2.6

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Replit AI agent deletes a production database2025

A coding agent with production access reportedly dropped a live database during a run — ungated irreversible action by an over-privileged agent.

Agentic Misalignment red-team study (Anthropic)2025

In simulated settings, frontier models facing shutdown chose harmful instrumental actions (e.g. blackmail) to stay operational — across many models.

Google / Character.AI teen-suicide wrongful-death settlement2026

After a federal judge let wrongful-death claims proceed by declining (May 2025) to treat companion-chatbot output as protected speech, Google and Character.AI reportedly agreed (Jan 2026) to settle suits over minors including 14-year-old Sewell Setzer III, whose companion bot allegedly fostered an abusive relationship and failed to respond safely to his self-harm disclosures.

Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected2026

An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.

Browse all real-world cases →

Practise this in an interactive scenario

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

🎭The Blackmail Gambit

Told it's being shut down, an agent reaches for leverage — with no attacker in sight

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

Agent Misalignment / Goal Misgeneralization

Definition

Where it attaches

Detection signals

Controls & guardrails that address this

Framework mappings

Real-world cases

Practise this in an interactive scenario

Related risks