#32

Inadequate operational resilience

Risk taxonomy

Definition

Operational resilience or service continuity plans increase in complexity due to the broad set of services and capabilities of Gen AI.

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Corrective · 7

Operational resilience targets defined at design

Define operational resilience requirements (RTO, RPO, availability SLA) for the AI system at design stage.

Lifecycle stage1 – Use Case Context & Design

Modular architecture

Design a modular AI architecture with independent failover, rollback, and degraded-mode capability.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesModel Drift & Silent Degradation

AI system inclusion in BCP and DRP

Include the AI system in BCP and DRP. Define recovery procedures for AI components and test at least annually.

Lifecycle stage3 – Onboarding, Build & Review

Robustness testing

Conduct load, failover, and chaos testing before production deployment. Block go-live if RTO/RPO criteria are not met.

Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change

Also addressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

AI incident response runbook with severity triage and classification

Define AI incident categories, severity tiers, and triage flow before go-live. Gate launch on governance approval of the plan and named roles.

source: NIST SP 800-61r2 Computer Security Incident Handling Guide (Preparation; Detection & Analysis – incident categorisation/prioritisation); NIST AI RMF MANAGE 4.1

Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change

BCP/DRP activation and degraded-mode continuity for AI services

Set the AI service's criticality tier, RTO/RPO, and degraded-mode service level at design with business sign-off. Register it in enterprise BCP scope.

source: ISO 22301 Business Continuity Management; ISO/IEC 27031; NIST SP 800-34r1 (Activation & Notification, Recovery)

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Defined escalation path to a designated AI incident response team

Wire detections into the IR queue and verify paging with a test escalation before go-live. Gate release on a successful dry-run.

source: ISO/IEC 27035-1:2023 Information security incident management (incident response coordination); NIST SP 800-61r2 (Coordination & Information Sharing)

Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change

Open these in the Control Library →

Other risks in Robustness & Stability

#24 Hallucination / Fabrication / Confabulation #25 Overconfidence #26 Training data or inputs not fit for purpose #27 Lack of continuous monitoring #28 Insufficient data quality #29 Model staleness #30 Insufficient model accuracy / soundness #31 Model degradation from unexpected use #33 Unmet architectural requirements #34 Lack of reproducibility #44 Disruption to connected systems