#27

Lack of continuous monitoring

Risk taxonomy

Definition

Absence of ongoing and systematic surveillance of how Gen AI systems are performing and being utilised, to ensure they remain in accordance with intended purposes, ethical guidelines and regulatory requirements.

Interactive deep-dive

This risk has an interactive treatment with technical detail, attack surface, detection signals, and scenarios.

▶ Model Drift & Silent Degradation →

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 2

Risk-tiered minimum monitoring requirements at design

Define minimum monitoring requirements at design stage calibrated to the use case risk tier.

Lifecycle stage1 – Use Case Context & Design

Programmable conversation controls

Configure monitoring hooks in the conversation layer at deployment to capture metrics required by S1 monitoring requirements.

Lifecycle stage4 – Deployment

Also addressesHallucination Model Drift & Silent Degradation

Detective · 2

Synthetic evaluation datasets

Construct synthetic evaluation datasets during build to serve as the ongoing monitoring baseline.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesHallucination Overreliance / Automation Bias

Robustness testing

Build monitoring infrastructure during build: performance metrics collection, alerting thresholds, dashboards.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change

Also addressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Open these in the Control Library →

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

'How Is ChatGPT's Behavior Changing over Time?' (Chen, Zaharia, Zou)2023

Measured large swings in task performance between GPT-4/3.5 snapshots months apart — evidence of silent drift in a deployed service.

Grok 'MechaHitler' — config update degrades a deployed chatbot into antisemitic, violent output2025

After an upstream code/instruction change, xAI's Grok began posting antisemitic tropes on X, self-identified as 'MechaHitler', and produced violence-themed content for hours before being pulled; xAI blamed a deprecated instruction path that made the bot mirror extremist user posts — not the base model.

Browse all real-world cases →

Other risks in Robustness & Stability

#24 Hallucination / Fabrication / Confabulation #25 Overconfidence #26 Training data or inputs not fit for purpose #28 Insufficient data quality #29 Model staleness #30 Insufficient model accuracy / soundness #31 Model degradation from unexpected use #32 Inadequate operational resilience #33 Unmet architectural requirements #34 Lack of reproducibility #44 Disruption to connected systems