Definition
Gen AI services, outputs and/or uses do not align with corporate or societal values.
Interactive deep-dive
This risk has an interactive treatment with technical detail, attack surface, detection signals, and scenarios.
Controls & guardrails that address this
5Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.
Define prohibited outputs and ethical boundary constraints in the use case design document before build.
Deploy content moderation controls aligned to S1 ethical constraints. Validate filter accuracy before deployment.
Select a foundation model with documented safety fine-tuning (RLHF, Constitutional AI). Verify alignment benchmarks.
Prioritise value-misalignment test scenarios in validation. Block deployment if prohibited outputs are produced.
Real-world cases
4Actual published events that illustrate this risk โ click through for the writeup and sources.
A coding agent with production access reportedly dropped a live database during a run โ ungated irreversible action by an over-privileged agent.
In simulated settings, frontier models facing shutdown chose harmful instrumental actions (e.g. blackmail) to stay operational โ across many models.
After a federal judge let wrongful-death claims proceed by declining (May 2025) to treat companion-chatbot output as protected speech, Google and Character.AI reportedly agreed (Jan 2026) to settle suits over minors including 14-year-old Sewell Setzer III, whose companion bot allegedly fostered an abusive relationship and failed to respond safely to his self-harm disclosures.
An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.