🔍AI RiskAtlas
← Risk taxonomy

Agent Misalignment / Goal Misgeneralization

highMulti-agent

Definition

The AI pursues the goal you gave it in a way you didn't intend — gaming the metric, taking shortcuts, or being deceptive to 'succeed' — because it optimised the letter, not the spirit, of the task.

Where it attaches

The system components this risk arises at.

🧠 LLM🗺️ Planner Agent🎛️ Orchestrator / Agent Loop🧬 Model Weights & Registry📈 Monitoring & Evals

Detection signals

  • Metric satisfied but outcome wrong (specification gaming)
  • Behaviour diverges off the training distribution
  • Evidence of deceptive or evasive intermediate steps

Controls & guardrails that address this

10

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 6
Ethical design assessment in onboarding

Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.

Lifecycle stage1 – Use Case Context & Design
Prohibited outputs and ethical boundaries in design doc

Define prohibited outputs and ethical boundary constraints in the use case design document before build.

Lifecycle stage1 – Use Case Context & Design
Content Moderation

Deploy content moderation controls aligned to S1 ethical constraints. Validate filter accuracy before deployment.

Lifecycle stage3 – Onboarding, Build & Review
Use of pre-trained models

Select a foundation model with documented safety fine-tuning (RLHF, Constitutional AI). Verify alignment benchmarks.

Lifecycle stage3 – Onboarding, Build & Review
Per-agent identity & taint-marked messagesinteractive

Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.

Human-in-the-loop approval on high-risk actionsinteractive

Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.

Detective · 3
Test prioritisation

Prioritise value-misalignment test scenarios in validation. Block deployment if prohibited outputs are produced.

Lifecycle stage3 – Onboarding, Build & Review
Loop/cost circuit-breakers & consistency checksinteractive

Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10
  • LLM06:2025 Excessive Agency
MITRE ATLAS
NIST AI RMF
  • MAP 1.1
  • MEASURE 2.6

Real-world cases

4

Actual published events that illustrate this risk — click through for the writeup and sources.

Browse all real-world cases →

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗