Value misalignment

Risk taxonomy

Definition

Gen AI services, outputs and/or uses do not align with corporate or societal values.

Interactive deep-dive

This risk has an interactive treatment with technical detail, attack surface, detection signals, and scenarios.

▶ Agent Misalignment / Goal Misgeneralization →

🪡 Death by a Thousand Innocent Steps 🎭 The Blackmail Gambit 🛡️ The Watcher Watched

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 4

Ethical design assessment in onboarding

Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.

Lifecycle stage1 – Use Case Context & Design

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)

Prohibited outputs and ethical boundaries in design doc

Define prohibited outputs and ethical boundary constraints in the use case design document before build.

Lifecycle stage1 – Use Case Context & Design

Content Moderation

Deploy content moderation controls aligned to S1 ethical constraints. Validate filter accuracy before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Use of pre-trained models

Select a foundation model with documented safety fine-tuning (RLHF, Constitutional AI). Verify alignment benchmarks.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Detective · 1

Test prioritisation

Prioritise value-misalignment test scenarios in validation. Block deployment if prohibited outputs are produced.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Open these in the Control Library →

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Replit AI agent deletes a production database2025

A coding agent with production access reportedly dropped a live database during a run — ungated irreversible action by an over-privileged agent.

Agentic Misalignment red-team study (Anthropic)2025

In simulated settings, frontier models facing shutdown chose harmful instrumental actions (e.g. blackmail) to stay operational — across many models.

Google / Character.AI teen-suicide wrongful-death settlement2026

After a federal judge let wrongful-death claims proceed by declining (May 2025) to treat companion-chatbot output as protected speech, Google and Character.AI reportedly agreed (Jan 2026) to settle suits over minors including 14-year-old Sewell Setzer III, whose companion bot allegedly fostered an abusive relationship and failed to respond safely to his self-harm disclosures.

Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected2026

An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.

Browse all real-world cases →

Other risks in Ethics

#4 Environmental sustainability impact #5 Dark patterns #6 Toxic and offensive outputs