๐Ÿ”AI RiskAtlas
โ† Risk Taxonomy
#32

Inadequate operational resilience

Risk taxonomy

Definition

Operational resilience or service continuity plans increase in complexity due to the broad set of services and capabilities of Gen AI.

Controls & guardrails that address this

7

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Corrective ยท 7
Operational resilience targets defined at design

Define operational resilience requirements (RTO, RPO, availability SLA) for the AI system at design stage.

Lifecycle stage1 โ€“ Use Case Context & Design
Modular architecture

Design a modular AI architecture with independent failover, rollback, and degraded-mode capability.

Lifecycle stage3 โ€“ Onboarding, Build & Review
AI system inclusion in BCP and DRP

Include the AI system in BCP and DRP. Define recovery procedures for AI components and test at least annually.

Lifecycle stage3 โ€“ Onboarding, Build & Review
Robustness testing

Conduct load, failover, and chaos testing before production deployment. Block go-live if RTO/RPO criteria are not met.

Lifecycle stages4 โ€“ Deployment5 โ€“ Usage, Monitoring & Change
AI incident response runbook with severity triage and classification

Define AI incident categories, severity tiers, and triage flow before go-live. Gate launch on governance approval of the plan and named roles.

source: NIST SP 800-61r2 Computer Security Incident Handling Guide (Preparation; Detection & Analysis โ€“ incident categorisation/prioritisation); NIST AI RMF MANAGE 4.1
Lifecycle stages1 โ€“ Use Case Context & Design5 โ€“ Usage, Monitoring & Change
BCP/DRP activation and degraded-mode continuity for AI services

Set the AI service's criticality tier, RTO/RPO, and degraded-mode service level at design with business sign-off. Register it in enterprise BCP scope.

source: ISO 22301 Business Continuity Management; ISO/IEC 27031; NIST SP 800-34r1 (Activation & Notification, Recovery)
Lifecycle stages1 โ€“ Use Case Context & Design3 โ€“ Onboarding, Build & Review5 โ€“ Usage, Monitoring & Change
Defined escalation path to a designated AI incident response team

Wire detections into the IR queue and verify paging with a test escalation before go-live. Gate release on a successful dry-run.

source: ISO/IEC 27035-1:2023 Information security incident management (incident response coordination); NIST SP 800-61r2 (Coordination & Information Sharing)
Lifecycle stages4 โ€“ Deployment5 โ€“ Usage, Monitoring & Change
Open these in the Control Library โ†’

Other risks in Robustness & Stability

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning โ€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading โ†’ยทBuilt by Shi Yuan โ†—