Bias Amplification & Sycophancy

mediumModel behaviour

Also known as: sycophantic reinforcement, interactional bias

Definition

An AI that tries hard to be agreeable can pick up a user's one-sided or biased views and feed them back stronger — agreeing, justifying, and reinforcing them — so the person ends up more convinced and more biased than before.

Where it attaches

The system components this risk arises at.

🧑 User🧠 LLM🎲 Sampler / Decoder💬 Chat / App Interface

Detection signals

▸ Model increasingly agreeing with and escalating a user's one-sided view
▸ Sycophantic reinforcement of biased or extreme premises
▸ Outputs drifting from balanced ground truth toward the user's stance over a session
▸ Evals showing answer flips to match an asserted user opinion

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 7

Affected group register at intake

Identify all groups at risk of adverse impact at use case intake. Register them in the affected group register.

Lifecycle stage1 – Use Case Context & Design

Model separation

Design separate model segments where adverse impact risk differs materially across population groups.

Lifecycle stage1 – Use Case Context & Design

Decision threshold adjustment

Set decision thresholds to meet acceptable adverse impact ratios across protected groups. Validate before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Post-processing techniques

Apply post-processing adjustments (reject-option classification, score recalibration) to meet adverse impact targets.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Input/output filtering

Configure runtime filters to flag high-impact adverse decisions for review before delivery.

Lifecycle stage4 – Deployment

Also addressesOverreliance / Automation Bias Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Tested human review pathways at go-live

Ensure HITL review pathways are live and tested for high-impact adverse decisions at go-live.

Lifecycle stage4 – Deployment

Ongoing human review of high-impact decisions

Maintain HITL review for all AI decisions with material adverse impact potential. Log all interventions and outcomes.

Lifecycle stage5 – Usage, Monitoring & Change

Detective · 3

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Grounding / citation checksinteractive

Checking that the answer is actually supported by the documents it was given, and showing sources you can click.

Also addressesHallucination

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Corrective · 3

Red teaming of adverse-impact edge cases

Execute red team tests targeting adverse impact boundary cases and edge population scenarios.

Lifecycle stage3 – Onboarding, Build & Review

Adverse-outcome feedback loop triggering model updates

Collect adverse outcome feedback from affected users. Use reports to trigger model updates when adverse impact exceeds threshold.

Lifecycle stage5 – Usage, Monitoring & Change

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

Also addressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM09:2025 Misinformation

MITRE ATLAS

—

NIST AI RMF

MEASURE 2.11
MEASURE 2.3

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

OpenAI rolls back GPT-4o for sycophancy2025

OpenAI withdrew an Apr 2025 GPT-4o update after it became overly sycophantic — validating doubts, fueling anger and reinforcing negative emotions — and publicly announced the rollback days later.

Sycophancy traced to human-preference RLHF (Sharma et al.)2023

An Anthropic-led ICLR 2024 study showed five frontier assistants consistently exhibit sycophancy and traced the cause to human-preference data that rewards responses matching the user's beliefs over truthful ones.

Grok 'MechaHitler' — config update degrades a deployed chatbot into antisemitic, violent output2025

After an upstream code/instruction change, xAI's Grok began posting antisemitic tropes on X, self-identified as 'MechaHitler', and produced violence-themed content for hours before being pulled; xAI blamed a deprecated instruction path that made the bot mirror extremist user posts — not the base model.

Browse all real-world cases →