Definition
Models generate outputs that can be detrimental or inappropriate for individuals or groups.
Interactive deep-dive
This risk surfaces under more than one interactive treatment โ each with its own technical detail, attack surface, detection signals, and scenarios.
Controls & guardrails that address this
9Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Identify all groups at risk of adverse impact at use case intake. Register them in the affected group register.
Design separate model segments where adverse impact risk differs materially across population groups.
Set decision thresholds to meet acceptable adverse impact ratios across protected groups. Validate before deployment.
Apply post-processing adjustments (reject-option classification, score recalibration) to meet adverse impact targets.
Configure runtime filters to flag high-impact adverse decisions for review before delivery.
Ensure HITL review pathways are live and tested for high-impact adverse decisions at go-live.
Maintain HITL review for all AI decisions with material adverse impact potential. Log all interventions and outcomes.
Execute red team tests targeting adverse impact boundary cases and edge population scenarios.
Collect adverse outcome feedback from affected users. Use reports to trigger model updates when adverse impact exceeds threshold.
Real-world cases
3Actual published events that illustrate this risk โ click through for the writeup and sources.
OpenAI withdrew an Apr 2025 GPT-4o update after it became overly sycophantic โ validating doubts, fueling anger and reinforcing negative emotions โ and publicly announced the rollback days later.
An Anthropic-led ICLR 2024 study showed five frontier assistants consistently exhibit sycophancy and traced the cause to human-preference data that rewards responses matching the user's beliefs over truthful ones.
After an upstream code/instruction change, xAI's Grok began posting antisemitic tropes on X, self-identified as 'MechaHitler', and produced violence-themed content for hours before being pulled; xAI blamed a deprecated instruction path that made the bot mirror extremist user posts โ not the base model.