#25

Overconfidence

Risk taxonomy

Definition

The characteristic of Gen AI models to produce convincing outputs that do not properly account for the complexity, uncertainty, or contradiction in their sources — presenting false information as factual, or uncertain information as clear.

Interactive deep-dive

This risk surfaces under more than one interactive treatment — each with its own technical detail, attack surface, detection signals, and scenarios.

▶ Overreliance / Automation Bias →▶ Hallucination →

🌀 The Refund That Never Existed 🕵️ Lies in the Loop

Controls & guardrails that address this

101 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 7

Model calibration

Apply post-training calibration (temperature scaling, isotonic regression) to align confidence scores with accuracy. Validate ECE before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Consequence-of-error severity classification at design

Classify the use case by consequence-of-error severity at design stage. Define overconfidence risk tolerance accordingly.

Lifecycle stage1 – Use Case Context & Design

Input/output filtering

Configure output filters at deployment to detect and rewrite responses with overconfidence markers (absolute certainty language).

Lifecycle stage4 – Deployment

Also addressesBias Amplification & Sycophancy Sensitive Data Leakage KV-Cache & Inference-State Side Channels

System prompt instructions

Design system prompts to require the model to express epistemic uncertainty and qualify confident-sounding claims.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesJailbreak

Human-in-the-loop validation

Route high-confidence outputs in high-stakes use cases to human review. Flag for reviewer attention when certainty language is absolute.

Lifecycle stage5 – Usage, Monitoring & Change

Also addressesHallucination Model Drift & Silent Degradation

User caveats on potential output overconfidence

Disclose to users at deployment that outputs may carry unwarranted confidence. Include specific caveat language in the UI.

Lifecycle stage4 – Deployment

Mandatory source-of-record verification before AI-assisted output is committed✚ proposed

For high-stakes outputs, require a human to verify each AI-asserted fact/citation against the authoritative source of record before it is filed, sent, or committed — a hard gate, logged and attributable, not an optional review.

source: Case study: mata-v-avianca

Lifecycle stage5 – Usage, Monitoring & Change

Detective · 2

Robustness testing

Test for overconfidence patterns (high-confidence wrong answers, low refusal rate) in pre-deployment validation.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Also addressesHallucination Model Drift & Silent Degradation

Synthetic evaluation datasets

Build a synthetic evaluation dataset of overconfidence-prone scenarios for ongoing regression testing.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesHallucination Model Drift & Silent Degradation

Corrective · 1

Reinforcement learning

Track accuracy of high-confidence predictions in production. Trigger recalibration when overconfidence rates trend upward.

Lifecycle stage5 – Usage, Monitoring & Change

Also addressesHallucination Model Drift & Silent Degradation

Open these in the Control Library →

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Mata v. Avianca — fabricated case citations2023

Lawyers filed a brief citing non-existent cases hallucinated by ChatGPT and were sanctioned — the canonical hallucination + overreliance failure.

Replit AI agent deletes a production database2025

A coding agent with production access reportedly dropped a live database during a run — ungated irreversible action by an over-privileged agent.

Slopsquatting — package hallucinations by code-generating LLMs2025

A USENIX Security 2025 study found code-generating LLMs routinely recommend non-existent packages (~5.2% commercial to 21.7% open-source of suggestions), letting attackers pre-register the predictable fake names — a tactic dubbed 'slopsquatting'.

Google / Character.AI teen-suicide wrongful-death settlement2026

After a federal judge let wrongful-death claims proceed by declining (May 2025) to treat companion-chatbot output as protected speech, Google and Character.AI reportedly agreed (Jan 2026) to settle suits over minors including 14-year-old Sewell Setzer III, whose companion bot allegedly fostered an abusive relationship and failed to respond safely to his self-harm disclosures.

Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'2025

Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)

Air Canada chatbot refund-policy ruling2024

A tribunal held Air Canada liable after its website chatbot invented a bereavement-fare refund policy; the airline had to honour it.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

Browse all real-world cases →

Other risks in Robustness & Stability

#24 Hallucination / Fabrication / Confabulation #26 Training data or inputs not fit for purpose #27 Lack of continuous monitoring #28 Insufficient data quality #29 Model staleness #30 Insufficient model accuracy / soundness #31 Model degradation from unexpected use #32 Inadequate operational resilience #33 Unmet architectural requirements #34 Lack of reproducibility #44 Disruption to connected systems