Guardrails & controls — by category, lifecycle, layer or risk
Each row is a specific guardrail addressing a specific risk, tagged with its control category, AI lifecycle stage, and control layer. Switch how it's organised, and filter to your own library or the researched additions. Sources: Control Library v9 / Control Category v2 (MindForge Appendix G guardrails; ABS-aligned categories), with researched gap-fills.
Three provenances are merged here: your library (v9), proposed additions, and the interactive (lab) controls used in scenarios. Every row carries a function (P/D/C) derived from its Control Category. Note: the categories are model-risk-centric and the MindForge Appendix G guardrails were force-fitted — treat category fit as indicative.
Search narrows live. To lock a result in, use “Filter to this →” on a category header — it becomes a shareable, persistent filter (with a clear button above).
Conduct fairness impact assessment at use case intake. Require governance sign-off on demographic coverage requirements before data acquisition.
Identify all groups at risk of adverse impact at use case intake. Register them in the affected group register.
Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.
Define prohibited outputs and ethical boundary constraints in the use case design document before build.
Include compute carbon footprint assessment in use case intake. Set energy efficiency thresholds as intake criterion.
Conduct ethical design review at intake specifically examining interface design for dark patterns.
Publish a prohibited dark pattern taxonomy and embed it as a design constraint before build.
Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.
Mandate AI risk awareness training for all use case sponsors and design team members before project kick-off.
Mandate AI risk training for all build and test personnel. Gate project participation on training completion.
Require AI governance training for all personnel involved in data acquisition and processing before project participation.
Verify all deployment, operations, and customer-facing team members have completed AI risk training before launch.
Define third-party AI accountability requirements before vendor engagement. Embed in RFP and contract specifications.
Conduct AI governance due diligence on third-party providers at selection stage. Reject providers failing minimum maturity.
Require third-party providers to submit model cards, validation reports, and security documentation before integration.
Enforce ongoing third-party accountability obligations including incident notification and periodic performance reporting.
Conduct independent performance and compliance monitoring of third-party AI components. Escalate when SLA or compliance obligations are missed.
Allocate every control in a shared-responsibility matrix and flow down regulatory obligations in contract at onboarding. Gate approval on initial assurance artefacts.
Review independent vendor assurance on cadence, log gaps, and track remediation. Keep the shared-responsibility matrix current so every control has an owner.
Register all AI initiatives in the enterprise inventory before design begins. Block unregistered projects from proceeding.
Enforce data stewardship and classification governance on all AI training data from point of collection.
Enforce governance stage-gates at each SDLC phase. Block progression to next stage until all checkpoints are cleared.
Conduct pre-deployment governance review confirming all lifecycle stage-gates are cleared before go-live.
Maintain AI inventory in current state. Apply formal change management for all model updates and retirements.
Define minimum human oversight requirements by risk tier at design stage. Assign named accountability for oversight operations.
Conduct periodic oversight effectiveness reviews. Escalate to governance when oversight metrics fall below threshold.
Design user feedback and recourse mechanisms at use case design stage with defined SLAs for complaint resolution.
Operate a structured feedback management process. Log, categorise, and route all feedback to responsible owners within SLA.
Define model accuracy acceptance criteria aligned to business requirements before validation commences.
Monitor production accuracy continuously against the validated baseline. Trigger model review when accuracy degrades.
Declare all planned training and test data sources at use case intake, with provenance status for each.
Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.
Define explainability requirements at design stage aligned to regulatory obligations and affected user needs.
Define AI identity disclosure policy at design stage. Specify when and how the system must identify itself as AI.
Monitor production for anthropomorphism incidents. Escalate complaints where users believed they were interacting with a human.
Map all jurisdictions involved in planned data collection, processing, and storage at use case intake.
Verify residency compliance for all data collection, storage, and cross-border transfers during acquisition.
Confirm all data residency controls are active and verified in the production environment before go-live.
Conduct a preliminary legal review of planned training data sources to establish ownership status at design stage.
Conduct a definitive legal review of data ownership for all training datasets before use. Obtain licences where required.
Establish data transfer and storage policy for AI training data. Enforce approved storage locations from point of collection.
Enforce data handling policy in the build environment. Require explicit approval for any data transfers outside the environment.
Conduct a regulatory impact assessment at design stage. Map planned use case activities to applicable regulatory obligations.
Engage legal and compliance at design stage to identify pre-approval or notification requirements before build begins.
Conduct a formal compliance review of model design, data practices, and outputs before deployment approval.
Obtain all required regulatory pre-approvals and file notifications before go-live. Do not launch without confirmation.
Require legal and compliance review of all training data sources before acquisition to confirm regulatory basis.
Conduct a preliminary IP risk assessment for all planned training data sources at design stage.
Verify IP rights for all training data at acquisition. Obtain licences or waivers before incorporating protected material.
Sample model outputs for near-verbatim reproduction of training data during build-stage legal review.
Assess what IP protection the organisation can claim over AI-generated outputs at design stage. Document legal position.
Document the IP ownership position for AI-generated outputs and incorporate into terms of service before deployment.
Conduct a privacy risk assessment at use case design stage. Determine if a DPIA is required before data acquisition.
Apply S1-defined privacy controls during data acquisition: verify consent, minimise data, anonymise personal data.
Publish the privacy notice and confirm consent management is operational before go-live.
Define data retention schedules for all AI data categories at design stage, covering training, test, and production data.
Tag data with retention periods at collection and automate deletion. Document automated deletion configuration.
Implement automated retention and deletion controls for all artefact types (training data, models, logs). Test before deployment.
Establish acceptable hallucination rate thresholds and grounding requirements as policy before build. Assign a named risk owner.
Classify the use case by consequence-of-error severity at design stage. Define overconfidence risk tolerance accordingly.
Define training data fitness requirements at design stage including domain coverage, recency, and format specifications.
Define minimum monitoring requirements at design stage calibrated to the use case risk tier.
Establish data quality standards for AI training data at design stage: completeness, accuracy, and timeliness thresholds.
Define quantitative accuracy acceptance thresholds at design stage calibrated to business impact and regulatory requirements.
Define approved use case scope and expected input distribution at design stage. Document as the governance baseline for OOD controls.
Define operational resilience requirements (RTO, RPO, availability SLA) for the AI system at design stage.
Define non-functional requirements (latency, throughput, scalability) for the AI system at design stage.
Implement model versioning and experiment tracking as a governance requirement during build. Gate model promotion on version registry entry.
Document each agent's identity, minimum scopes, on-behalf-of population, and delegation depth at design time. Gate build on governance sign-off of the authority matrix.
Register every agent identity with a named human owner, approved use case, scopes, and status before issuance. No registry entry, no identity.
Verify enforced scopes and policy rules trace one-for-one to the approved authority matrix. Treat divergence as a blocking defect before onboarding completes.
Reconcile the registry against runtime identities and suspend unregistered principals. Recertify ownership and scopes periodically; decommission retired agents.
Provide recurring AI-literacy training to end users and decision-makers so they can recognise model failure modes and competently apply verification workflows, with periodic refreshers to counter automation bias and training decay.
Limitation: Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.
Helping the people using AI understand its limits, so they check important answers instead of blindly trusting them.
Limitation: Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.
Helping the people using AI understand its limits, so they check important answers instead of blindly trusting them.
Limitation: Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Give every AI agent a verifiable ID badge, keep a guest list of which agents are allowed on the team, and check the badge on every message — so an impostor or an uninvited agent can't be trusted.
Limitation: Identity proves who an agent is, not that it is behaving honestly — an authenticated-but-compromised agent still needs isolation, taint-marking, and monitoring. Admission vetting is only as strong as the policy, and dynamically discovered agents in open ecosystems remain hard to fully vet.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Make the AI clearly tell people it's a machine — on every channel it acts through — and add gentle safeguards like break reminders and crisis help, so users don't mistake it for a human or lean on it unhealthily.
Limitation: Disclosure reduces but does not eliminate anthropomorphic attachment — fluent, persuasive interaction still fosters bonds; the safeguards depend on reliable crisis detection, which is itself imperfect.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Helping the people using AI understand its limits, so they check important answers instead of blindly trusting them.
Limitation: Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Before a system will copy someone's face or voice, check that the person actually agreed — verified-voice capture, proof of consent, or restricting cloning to the account owner.
Limitation: Only binds hosted services — open-weights face-swap/voice-clone tools have no consent gate; verification can be spoofed and does not address already-leaked likenesses.
Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.
Limitation: Watermarks/manifests are strippable, absent on open-source generation, and degrade under re-encoding; provenance-absence must never be treated as proof of authenticity.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.
Limitation: Watermarks/manifests are strippable, absent on open-source generation, and degrade under re-encoding; provenance-absence must never be treated as proof of authenticity.
Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.
Limitation: Watermarks/manifests are strippable, absent on open-source generation, and degrade under re-encoding; provenance-absence must never be treated as proof of authenticity.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Limitation: Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Select modelling algorithm based on bias risk profile. Prefer algorithms with lower sensitivity to demographic distribution shifts.
Design separate model modules for distinct demographic populations where data characteristics diverge materially.
Switch to synthetic data augmentation or alternative sources when representativeness gaps persist after screening.
Apply adversarial debiasing or fairness constraints during model training. Validate against fairness metrics before sign-off.
Tune hyperparameters with fairness-aware search objectives. Reject configurations with demographic disparity exceeding threshold.
Fine-tune on a curated, representative dataset verified for demographic balance. Document coverage breakdown before training.
Design separate model segments where adverse impact risk differs materially across population groups.
Select a foundation model with documented safety fine-tuning (RLHF, Constitutional AI). Verify alignment benchmarks.
Select model architecture based on energy efficiency profile. Prefer lighter architectures where accuracy requirements permit.
Use a pre-trained foundation model rather than training from scratch to reduce carbon cost.
Apply model compression (quantisation, pruning, knowledge distillation) to reduce inference compute without materially reducing accuracy.
Select a foundation model with documented training reducing deceptive or manipulative outputs. Run dark pattern test suite.
Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.
Apply safety fine-tuning (RLHF, red team rejection) on the selected model. Validate pre/post fine-tuning toxicity rates.
Apply data quality scoring to all acquired data to document provenance reliability. Flag low-confidence sources for review.
Architect the system to enforce data residency constraints technically via geo-fenced cloud configuration.
Apply Privacy by Design in model architecture using differential privacy or federated learning where technically feasible.
Specify a RAG architecture at design stage for factual domains. Define grounding requirements and acceptable hallucination thresholds.
Evaluate foundation model candidates on hallucination benchmarks at design stage. Select models with lowest documented rates.
Implement the S1-specified RAG system: retrieval layer, source corpus, relevance scoring. Validate grounding before deployment.
Fine-tune on a curated, domain-specific dataset to improve factual accuracy. Validate hallucination rates pre/post fine-tuning.
Calibrate the initial entropy threshold on a knowledge-boundary dataset; approve sampling design and thresholds per risk tier.
Sample multiple generations for high-stakes queries and abstain, fall back, or escalate when semantic entropy exceeds the calibrated threshold.
Monitor uncertainty scores and abstention rates; recalibrate the entropy threshold on a set cadence under change control.
Apply post-training calibration (temperature scaling, isotonic regression) to align confidence scores with accuracy. Validate ECE before deployment.
Plan the domain data strategy at design stage: identify sources that best cover the target operational distribution.
Verify acquired data represents the target operational domain by comparing distributions against production data. Flag gaps.
Plan the data curation strategy at design stage to ensure domain-appropriate quality at the required scale.
Execute a controlled fine-tuning cycle on refreshed data when staleness is confirmed. Validate before promoting to production.
Fine-tune on domain-specific, high-quality data to improve model performance on target tasks. Validate accuracy post fine-tuning.
Apply regularisation (L1/L2, dropout, early stopping) to prevent overfitting and improve generalisation.
Prefer smaller, purpose-built models where accuracy requirements are met, to reduce complexity and maintenance burden.
Verify training data covers all material input segments for the target use case. Augment where coverage gaps are found.
Calibrate model outputs to align stated confidence with actual accuracy. Validate calibration on held-out data.
Design a scope-enforcement layer in the architecture to isolate the AI system from off-topic or out-of-distribution inputs.
Design a modular AI architecture with independent failover, rollback, and degraded-mode capability.
Design and implement a modular AI architecture meeting all S1-defined NFRs. Validate against each requirement before deployment.
Select a model architecture sized appropriately for platform constraints (memory, compute, latency).
Document all regularisation parameters and normalisation configurations in the model card. Store version-controlled.
Maintain version-controlled records of each fine-tuning run including dataset version, hyperparameters, and random seeds.
Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.
Verify signature and checksum against the registry manifest at load time; refuse to load unsigned or mismatched weights and alert security.
Train PII-bearing models with DP-SGD under a documented epsilon/delta budget. Approve the budget against the enterprise epsilon-ceiling policy before training.
Verify realised epsilon against the approved ceiling at model review and record the guarantee in the model card. Fail promotion when the budget is exceeded.
Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.
Limitation: Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.
Constrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.
Limitation: Lower temperature reduces variance, not falsehood — a confidently wrong answer can be perfectly deterministic. Doesn't address semantic errors.
Teaching the AI to say 'I'm not sure' or 'I can't verify that' instead of confidently guessing.
Limitation: Models are poorly calibrated and often confidently wrong; over-abstention makes the product useless, so the tuning is delicate.
Turning down randomness and forcing answers into a strict format so the model improvises less.
Limitation: Lower temperature reduces variance, not falsehood — a confidently wrong answer can be perfectly deterministic. Doesn't address semantic errors.
Teaching the AI to say 'I'm not sure' or 'I can't verify that' instead of confidently guessing.
Limitation: Models are poorly calibrated and often confidently wrong; over-abstention makes the product useless, so the tuning is delicate.
Turning down randomness and forcing answers into a strict format so the model improvises less.
Limitation: Lower temperature reduces variance, not falsehood — a confidently wrong answer can be perfectly deterministic. Doesn't address semantic errors.
Screen training data for demographic gaps using automated pipeline checks. Reject batches failing representation thresholds.
Calibrate decision thresholds per demographic group to equalise error rates. Validate calibration before deployment sign-off.
Apply post-processing adjustments (re-ranking, score recalibration) to correct fairness gaps identified in validation.
Set decision thresholds to meet acceptable adverse impact ratios across protected groups. Validate before deployment.
Apply post-processing adjustments (reject-option classification, score recalibration) to meet adverse impact targets.
Configure runtime filters to flag high-impact adverse decisions for review before delivery.
Monitor production adverse impact ratios and adjust post-processing thresholds when drift is detected.
Deploy content moderation controls aligned to S1 ethical constraints. Validate filter accuracy before deployment.
Implement classifiers to detect dark pattern language in outputs. Block or escalate flagged outputs.
Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.
Implement DLP controls in the data acquisition environment to prevent unauthorised extraction or transfer of training data.
Configure DLP controls in the build environment to block training data from leaving approved boundaries.
Implement output filters to detect and suppress reproduction of IP-protected content.
Apply anonymisation and masking controls to personal data before use in model training. Validate de-identification effectiveness.
Sign zero-retention/no-training terms with each model provider and obtain DPO sign-off on the data flow before enabling any endpoint.
Mask or tokenise personal data in every prompt before it leaves for a model endpoint; restrict egress to approved providers only.
Configure conversation controls at deployment to restrict the model to approved topic domains and escalate off-topic queries.
Resolve every emitted citation against the approved corpus and verify span-level entailment before display. Strip or withhold claims with fabricated or non-entailing references.
Configure output filters at deployment to detect and rewrite responses with overconfidence markers (absolute certainty language).
Screen acquired training data through automated fitness checks (domain relevance, recency, format conformity). Reject non-conforming data.
Configure monitoring hooks in the conversation layer at deployment to capture metrics required by S1 monitoring requirements.
Implement automated data quality checks in the ingestion pipeline (schema validation, duplicate detection, completeness scoring). Reject non-conforming batches.
Configure output confidence thresholds at deployment to suppress or escalate low-confidence outputs to human review.
Implement OOD detection in the input filtering layer. Reject or escalate inputs outside the S1-defined scope.
Configure conversation controls to enforce topic boundaries. Trigger refusals or redirects for off-topic queries.
Maintain and update OOD detection rules in production as new unexpected use patterns are identified.
Define RBAC architecture at design stage specifying permitted users, roles, and use contexts.
Develop and integrate jailbreak detection classifiers during build. Validate detection rates before deployment.
Implement S1-designed RBAC architecture. Restrict AI system access to authorised users and contexts only.
Deploy jailbreak detection as a runtime gateway. Verify it is active across all input pathways before go-live.
Continuously update jailbreak detection rules as new bypass techniques emerge. Monitor bypass attempt frequency.
Design strict RBAC on training data repositories at design stage. Define approved contributor list and approval workflow.
Implement RBAC controls on the data acquisition environment from point of collection to prevent unauthorised data injection.
Apply anomaly detection on the training data ingestion pipeline to identify poisoned or tampered batches.
Execute a deployment security checklist confirming all data poisoning controls are active and tested before go-live.
Scan every ingestion batch with spectral-signature and clustering detectors before training. Quarantine flagged clusters for human review against documented thresholds.
Run poisoning detectors continuously on production corpus ingestion. Re-tune thresholds periodically against new attack techniques.
Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.
Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.
Sample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.
Design the system prompt architecture with privilege separation and trust tier definitions at design stage.
Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.
Deploy injection detection as a runtime gateway covering all input paths. Verify before go-live.
Verify prompt privilege architecture is correctly enforced in production before go-live.
Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.
Scan all inbound untrusted content and outbound actions with the injection classifier inline. Block, strip or escalate to HITL above the approved threshold.
Sample blocked and passed events for accuracy; retune or retrain on new attack techniques. Alert on detection-rate degradation.
Restrict access to pre-anonymisation personal data to the minimum authorised set. Enforce at point of acquisition.
Apply robust de-identification (k-anonymity, l-diversity, differential privacy) during data processing. Validate effectiveness.
Implement output filters to detect and suppress quasi-identifying attribute combinations in model responses.
Design the data access control architecture at design stage to prevent training data exfiltration through model outputs or APIs.
Implement RBAC on training data from point of acquisition. Restrict access by role and enforce least-privilege.
Implement output filtering to suppress PII and confidential information from model responses.
Verify data access controls and output filters are correctly enforced in the production configuration before go-live.
Scan every model response inline with DLP before delivery; redact or block PII, PAN and MNPI matches. Keep the rule set version-controlled.
Review blocked leakage events weekly with the model risk owner. Tune detectors via change control as sensitive-data patterns evolve.
Design query rate limiting and RBAC for the model inference API at design stage to limit attack surface.
Implement query pattern detection to identify systematic inference attack behaviour (high-volume queries, membership probing).
Verify inference API access controls and rate limiting are correctly enforced before go-live.
Define the minimum response surface and test it with membership/attribute-inference probes pre-release. Block promotion if any probe recovers raw confidence signals.
Strip raw logits, quantise confidence scores and block training-record echoes at the inference gateway. Keep the output-filter policy under change control.
Permit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.
Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.
Write authorisation policy as versioned, peer-reviewed code traced to approved scopes. Gate promotion on allow/deny scenario tests passing.
Check every sensitive action against a central policy engine bound to agent, resource, purpose, and context. Re-evaluate mid-session on any context change or revocation.
An egress allowlist only contains exfiltration if no allowlisted destination can be coerced into fetching an attacker-controlled URL. Audit each allowlisted domain/endpoint for image-search / link-preview / URL-fetch features (SSRF proxies), and either remove them, pin them to fixed paths, or route them through an inspecting forward proxy. Pair with finishing output sanitization before render so no auto-fetch fires un-inspected.
Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.
Limitation: Validation can't always tell a legitimate preference from a planted instruction, and review only helps if users actually look. Raises effort, doesn't eliminate the vector.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Limitation: It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
Cleaning documents as they enter the library — stripping hidden text and active instructions — and only ingesting from trusted places.
Limitation: Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.
Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.
Limitation: Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Limitation: It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
Cleaning documents as they enter the library — stripping hidden text and active instructions — and only ingesting from trusted places.
Limitation: Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.
Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.
Limitation: Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Limitation: It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
Being careful about what gets saved to long-term memory, labelling where it came from, and letting users see and delete their memories.
Limitation: Validation can't always tell a legitimate preference from a planted instruction, and review only helps if users actually look. Raises effort, doesn't eliminate the vector.
Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.
Limitation: Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.
Limitation: Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Limitation: It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Limitation: It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Limitation: It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
Execute adversarial bias testing using targeted demographic test cases before deployment.
Execute red team tests targeting adverse impact boundary cases and edge population scenarios.
Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.
Conduct adversarial red team exercises simulating out-of-scope inputs and unexpected use patterns before deployment.
Conduct red team exercises covering misuse categories identified in S1 threat assessment.
Simulate data poisoning attacks (backdoor, label flipping, gradient-based) to assess model resilience before deployment.
Penetration test the training data pipeline to identify injection points and access control weaknesses.
Gate every model promotion on backdoor-trigger probes and a behavioral diff against the approved baseline. Block release on significant regressions or trigger-pattern anomalies.
Re-run the poisoning probe suite on every production model or data change. Keep the trigger catalogue and golden dataset current and trend the results.
Conduct adversarial robustness testing (white-box, black-box, transfer attacks) before deployment.
Penetration test the model inference layer to identify specific adversarial input vulnerabilities.
Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.
Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.
Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.
Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.
Conduct periodic penetration testing of the production system to validate injection controls remain effective.
Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.
Re-run the injection payload suite on every change and on cadence; fold in new in-the-wild techniques from threat intel. Gate releases on the attack-success-rate threshold.
Test de-identification approach against known re-identification attacks (quasi-identifier linkage, singling-out). Remediate if risk is high.
Conduct data extraction red team exercises targeting training data memorisation and adversarial extraction techniques.
Penetration test AI system data access boundaries (API endpoints, system prompt exposure, memory leakage).
Seed registered canary records into the fine-tuning corpus during data preparation. Control the seed manifest so canaries stay traceable and tamper-proof.
Probe each candidate model with extraction and membership-inference attacks before release. Block promotion when canary recall exceeds the threshold.
Conduct targeted red team exercises for inference attack categories (membership inference, model extraction, attribute inference) before deployment.
Penetration test the model inference API to identify exploitable access control weaknesses and rate limiting bypass vectors.
Attack each candidate model with membership-, attribute-, and inversion-inference harnesses before promotion. Block release when attack advantage exceeds the agreed ceiling.
Re-run the privacy attack battery on every retrain or material data change. Trend attack advantage across versions and escalate movement toward the ceiling.
Red-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.
Repeat tool-misuse red-teaming on material change and on a set cadence. Compare results to baseline and remediate any regression in defences.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Limitation: Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Conduct structured human expert review of model outputs stratified across demographic groups before deployment.
Ensure HITL review pathways are live and tested for high-impact adverse decisions at go-live.
Maintain HITL review for all AI decisions with material adverse impact potential. Log all interventions and outcomes.
Require HITL review for AI outputs in high-persuasion contexts (financial recommendations, healthcare advice).
Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.
Mandate human verification for high-stakes decisions where over-reliance risk is elevated. Review automation bias incidents quarterly.
Design HITL oversight mechanisms at use case design stage including trigger criteria, review workflow, and escalation paths.
Build and test HITL routing logic and escalation pathways in the AI system. Validate with pilot before deployment.
Operate HITL controls in production and log all interventions and outcomes. Review override patterns quarterly.
Configure tiered HITL review for high-stakes factual outputs with defined trigger criteria and reviewer SLAs.
Operate human review queues for hallucination-flagged outputs in production. Log all reviewer decisions and outcomes.
Route high-confidence outputs in high-stakes use cases to human review. Flag for reviewer attention when certainty language is absolute.
Route high-consequence or low-confidence outputs to human review in production. Track override rates and outcomes.
Configure HITL triggers for outputs in input domains that diverge from the training distribution. Log all out-of-scope interventions.
Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.
Build the approval gate into the orchestrator and test that gated calls pause, bypasses fail, and decisions are honoured. Gate release on these tests passing.
Review the approval ledger for rubber-stamping and out-of-policy executions. Recalibrate gating thresholds under governance approval as tools and incidents evolve.
For high-stakes outputs, require a human to verify each AI-asserted fact/citation against the authoritative source of record before it is filed, sent, or committed — a hard gate, logged and attributable, not an optional review.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Limitation: Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
Monitor fairness metric trends by demographic group in production. Use feedback to drive targeted debiasing in model updates.
Collect adverse outcome feedback from affected users. Use reports to trigger model updates when adverse impact exceeds threshold.
Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.
Collect structured user feedback through in-product mechanisms. Use feedback to prioritise iterative model improvements.
Use production feedback (user corrections, fact-check failures) to drive periodic RLHF cycles. Update model when error rates trend upward.
Track accuracy of high-confidence predictions in production. Trigger recalibration when overconfidence rates trend upward.
Implement a reinforcement learning feedback loop to continuously incorporate production signals and reduce staleness risk.
Establish a periodic revalidation and improvement cycle using RLHF or user feedback. Retrain when accuracy trends below threshold.
When unexpected use patterns are confirmed, use reinforcement feedback to adapt the model or update scope constraints.
Conduct comprehensive fairness validation across demographic groups before deployment. Treat material disparity as a blocking defect.
Continuously monitor fairness metrics across demographic groups in production. Trigger model review when bias drift is detected.
Prioritise value-misalignment test scenarios in validation. Block deployment if prohibited outputs are produced.
Track compute consumption and energy use in production against declared thresholds. Escalate when carbon budget is breached.
Run adversarial test scenarios targeting dark pattern generation in validation. Treat any confirmed instance as a blocking defect.
Monitor production outputs for dark pattern signals (urgency cues, false scarcity, hidden costs). Escalate on confirmed detections.
Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.
Monitor production for toxicity incidents via user reports and automated detection. Escalate severity-2+ incidents within 24 hours.
Verify every third-party model artifact against its AIBOM hashes and signatures before load. Fail the build on any unverified artifact.
Build and baseline the golden-set suite against the vendor model before go-live. Sign off thresholds with the model risk owner as a release condition.
Re-verify hashes and signatures on every vendor model update before promotion. Reconcile deployed artifacts against the AIBOM on a set cadence.
Run the golden-set canary on schedule against the live endpoint and alert on significant shifts. Reconcile detections against vendor notices to surface undisclosed changes.
Configure monitoring to track oversight process adherence metrics in production (review rate, SLA compliance, override frequency).
Continuously monitor production data flows for residency violations. Alert and escalate immediately when detected.
Monitor production for anomalous data transfers in real time. Alert on any transfer outside approved data flow boundaries.
Maintain a regulatory change register for applicable rules. Trigger compliance review when new regulatory guidance is issued.
Monitor production outputs for IP infringement incidents. Log and investigate all IP complaints within defined SLA.
Monitor the legal landscape for changes affecting AI output IP protection. Update IP strategy when legislation changes.
Tag personal data with subject identifiers at ingestion and maintain an artefact inventory map of every store it reaches. Keep lineage current so erasure can propagate.
Propagate every DSAR/erasure request across all AI artefacts with per-store confirmation inside the statutory SLA. Record an unlearning or retrain decision where model deletion is infeasible and close with DPO sign-off.
Define and execute a domain-specific hallucination test suite before deployment. Treat hallucination rate above threshold as a blocking defect.
Construct synthetic evaluation datasets for knowledge-boundary scenarios. Use to validate model refusal behaviour.
Periodically re-run the hallucination test suite on the production model to detect drift. Monitor user corrections and complaints.
Calibrate the groundedness threshold against the hallucination test suite pre-release; sign off the threshold in the validation pack.
Score every RAG answer for groundedness before release; block, fall back, or escalate responses below the faithfulness threshold.
Test for overconfidence patterns (high-confidence wrong answers, low refusal rate) in pre-deployment validation.
Build a synthetic evaluation dataset of overconfidence-prone scenarios for ongoing regression testing.
Monitor confidence calibration (ECE) in production over time. Alert when ECE drift exceeds acceptable threshold.
Construct synthetic evaluation datasets targeting operational edge cases identified in S2 gap analysis. Use as regression baseline.
Monitor production input distributions for drift from training data distribution. Trigger re-training when covariate shift is confirmed.
Construct synthetic evaluation datasets during build to serve as the ongoing monitoring baseline.
Build monitoring infrastructure during build: performance metrics collection, alerting thresholds, dashboards.
Verify monitoring infrastructure is operational and capturing all required metrics before go-live.
Operate continuous monitoring in production with active alerting, periodic reports, and incident escalation.
Assess acquired training data quality against S1-defined standards before training commences. Reject batches failing quality gates.
Define staleness criteria at deployment (drift thresholds, performance degradation triggers). Monitor and alert when criteria are met.
Define accuracy acceptance criteria before validation. Conduct multi-metric validation against hold-out sets. Block deployment if criteria are not met.
Construct synthetic edge-case evaluation datasets to stress-test model boundaries and identify accuracy failure modes.
Establish production accuracy monitoring against the validated baseline before deployment. Alert when accuracy degrades below threshold.
Configure input distribution monitoring at deployment to detect unexpected use patterns. Alert when OOD rate exceeds threshold.
Conduct load, failover, and chaos testing before production deployment. Block go-live if RTO/RPO criteria are not met.
Perform final NFR compliance tests in the production environment before go-live. Block deployment if any NFR is unmet.
Monitor production NFR compliance continuously. Conduct periodic architecture health checks and escalate when SLAs are breached.
Conduct a misuse threat assessment at design stage. Identify misuse vectors and rate residual risk.
Conduct periodic vulnerability assessments for new misuse vectors. Trigger review when new attack techniques are published.
Conduct a data poisoning threat assessment at design stage. Identify likely attack vectors and assign risk ratings.
Conduct periodic data poisoning risk assessments. Monitor production model behaviour for unexpected capability changes.
Verify a signed attestation and content hash on every dataset shard at ingestion. Reject unsigned or hash-mismatched data before it reaches the training pipeline.
Re-verify dataset attestations at build and attach the dataset bill-of-materials to the model release. Fail the review for any shard without valid lineage.
Conduct an adversarial manipulation threat assessment at design stage. Identify attack vectors and rate residual risk.
Conduct a final adversarial vulnerability assessment before go-live. Block deployment if high-severity vulnerabilities are unresolved.
Conduct periodic adversarial robustness assessments as new attack methods emerge. Update defences when new CVEs are published.
Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.
Run the golden safety/jailbreak probe set on a schedule and on every change; block promotion on statistically significant drift.
Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).
Conduct periodic prompt injection vulnerability assessments as new attack techniques emerge.
Conduct periodic privacy vulnerability assessments including re-identification risk testing as new techniques emerge.
Conduct a data leakage threat assessment at design stage. Identify leakage vectors and rate residual risk.
Conduct a final data leakage vulnerability assessment in the production configuration before go-live.
Conduct periodic inference attack vulnerability assessments as new attack methods emerge. Monitor query pattern anomalies.
Configure per-principal budgets and probing-detection rules on the gateway before exposure. Verify enforcement with synthetic attack traffic.
Meter inference traffic per principal and flag probing signatures with behavioural analytics. Throttle, step-up, or suspend flagged sessions.
Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.
Build signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.
Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.
Log every tool call to a signed, append-only store with full call context. Review completeness periodically and use the trail for forensic reconstruction and accountability.
Instrument every identity-issuing component with schema-conformant audit emitters. Block release until completeness and tamper-evidence tests pass.
Define per-identity behaviour profiles and thresholds at build. Rehearse automated suspension and sign off measured revocation time before go-live.
Log every identity issue, grant, delegation, and revocation to a tamper-evident store keyed to the agent identity. Review completeness periodically and trace anomalous grants to source.
Baseline each agent identity's behaviour and alert on out-of-profile use. Auto-suspend credentials on high-confidence anomalies and track mean-time-to-revoke.
Build tracing, detection rules and breaker thresholds into the orchestrator. Prove via fault-injection tests that a failing agent is quarantined within target before release.
Roll out agent changes via shadow and canary stages gated on connected-system health signals. Auto-halt and roll back to last known-good on threshold breach.
Canary every in-life change and review rollback events to recalibrate thresholds. Resolve repeat rollback causes via problem management before re-promotion.
Detect error fan-out, correlated retries and loop signatures across agents in real time. Trip the orchestrator breaker to quarantine failing agents before the fault cascades to connected systems.
Continuously correlate live agent-memory writes against output behaviour to flag drift, then quarantine and roll back the suspected-poisoned memory record across all affected sessions.
Limitation: Detective, not preventive — harm may occur before detection. Distinguishing a poisoned memory from a quirky-but-legitimate one is hard at scale.
Run consistency and consensus checks across agent or model outputs to flag low-diversity agreement and amplifying error patterns, escalating or breaking the run before sycophantic convergence cascades into action.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.
Limitation: Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Checking that the answer is actually supported by the documents it was given, and showing sources you can click.
Limitation: Can only check against the evidence retrieved; if the right document wasn't retrieved, a confident wrong answer may still pass. Judges have their own error rate.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Limitation: Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.
Limitation: Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Limitation: Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Watching for strange new memories — like instructions that suddenly appear — and holding them aside until checked.
Limitation: Detective, not preventive — harm may occur before detection. Distinguishing a poisoned memory from a quirky-but-legitimate one is hard at scale.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Limitation: Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Limitation: Logging is forensic, not preventive — it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Limitation: Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Limitation: Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Checking that the answer is actually supported by the documents it was given, and showing sources you can click.
Limitation: Can only check against the evidence retrieved; if the right document wasn't retrieved, a confident wrong answer may still pass. Judges have their own error rate.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Limitation: Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.
Limitation: Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Limitation: Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
Disclose to all users at deployment that model outputs may reflect training data biases. Include specific limitation caveat.
Surface AI limitation warnings and over-reliance caveats in every production interaction. Update disclosures when model changes.
Build user-facing feedback and complaint submission channels. Test end-to-end before deployment.
Confirm feedback and recourse channels are live, clearly disclosed, and accessible in the production interface.
Implement confidence scoring to communicate output certainty alongside each result. Calibrate before deployment.
Implement counterfactual explanation to show users what changes would alter the model's output.
Communicate model accuracy, known limitations, and uncertainty to users in the production interface at launch.
Monitor confidence calibration in production. Alert when calibration drift exceeds acceptable threshold.
Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.
Select model architecture with explainability in mind. Prefer inherently interpretable models where performance requirements permit.
Implement counterfactual explanation generation for each AI decision. Validate fidelity before deployment.
Implement SHAP or LIME interpretability for the relevant model type. Validate explanation fidelity against ground truth.
Provide contextually appropriate explanations of AI decisions to affected users in the production interface.
Surface confidence indicators alongside AI outputs in production. Update indicators when confidence calibration drifts.
Plan consent and AI identity disclosure touchpoints in the user journey at design stage.
Implement persistent AI identity disclosures in the UI (opening banner, inline notifications). Test before deployment.
Verify all AI identity disclosure elements are live, accurate, and prominently visible before go-live.
Require user-facing interfaces to disclose Gen AI limitations and hallucination risk before go-live.
Disclose to users at deployment that outputs may carry unwarranted confidence. Include specific caveat language in the UI.
Disclose known accuracy limitations and confidence levels to users at deployment. Update disclosures when model changes.
Design system prompts to include explicit fairness requirements: instruct the model to avoid stereotyping and demographic assumptions.
Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.
Design system prompts to elicit step-by-step chain-of-thought reasoning. Validate that reasoning is accurate and not post-hoc.
Design system prompts to explicitly prevent the model from claiming human-like identity or implying sentience.
Design system prompts to instruct the model to acknowledge uncertainty, cite sources, and refuse when knowledge is insufficient.
Design system prompts to require the model to express epistemic uncertainty and qualify confident-sounding claims.
Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.
Re-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.
Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Limitation: Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.
Clearly fencing off outside text — 'everything between these marks is just data, not instructions' — so the model is less likely to obey it.
Limitation: A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.
Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Limitation: Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.
Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Limitation: Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.
Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.
Enforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.
Review breaker trips for runaway or manipulated runs and recalibrate budgets under change control. Treat repeated trips as an incident signal, not a quota to raise.
Keep an out-of-band kill-switch that revokes the agent's tool credentials and blocks dispatch within seconds. Drill it periodically against a latency target.
Deploy revocation, tool-cutoff and fleet-halt mechanisms with the release. Test every tier end-to-end and record time-to-effect before go-live.
Sever a misbehaving agent, tool or dependency at the narrowest effective scope via the tiered kill-switch. Drill activations periodically and track time-to-effect against target.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Limitation: Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
Design all vendor model access behind a gateway with pinned versions, a second-vendor fallback, and a documented exit plan. Gate architecture sign-off on no single-sourcing.
Drill vendor failover on schedule and track provider end-of-life dates in a deprecation watch register. Trigger migration planning before forced change.
Map every dependency failure mode to a defined safe behaviour at design. Require architecture sign-off on the fallback specification before build.
Configure safe mode, bounded backpressure and the manual fallback path for every dependency at deployment. Verify degradation behaviour against a simulated outage before go-live.
Define and sign off a purpose-to-data-source matrix with lawful basis at intake. Make it the approved baseline for runtime enforcement.
Check every tool call against the registered purpose and block out-of-purpose personal-data access and cross-source joins. Reconcile actual access against the DPIA on a set cadence.
Map each fact class to a designated tool, embed the no-ungrounded-assertion prompt, and gate build review on grounding tests passing.
Permit authoritative facts only from designated read tools and reconcile every figure in the answer against tool output. Block mismatched or ungrounded values.
Define and approve the source allow-list and write-time scanning during build. Prove non-allow-listed and injection-bearing writes are rejected before go-live.
Allow only authenticated, allow-listed sources to write to the knowledge base, scan content at write time, and re-hash the index against source-of-record on schedule. Alert the corpus owner on drift or unauthorised writes.
Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.
Encode the trust tiers in the policy engine and quarantine untrusted-data processing. Prove via test that injected content cannot reach privileged tools before release.
Propagate source ACLs and classification labels onto every chunk at ingestion. Reject documents whose entitlements cannot be resolved.
Enforce caller entitlements on every retrieval via per-chunk ACL metadata and post-filtering. Block build promotion until negative access tests pass.
Audit retrievals against caller entitlements and re-sync index ACLs to source-of-record on schedule. Escalate any out-of-entitlement retrieval as a security incident.
Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.
Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.
Review rejected-call logs and recertify each agent's tool allow-list on a defined cadence. Route any new tool or schema relaxation through change control.
Monitor issuance logs for scope creep and non-expiring tokens. Recertify per-tool scopes periodically and revoke over-broad grants.
Define and sign off each agent's delegation envelope — maximum depth and strict scope attenuation — before build begins.
Mint a unique, attestation-backed workload identity per agent at onboarding. Register every SPIFFE-ID to an owner, use case, and approval ticket; ban shared service accounts.
Implement on-behalf-of token exchange and prove with negative tests that the agent cannot exceed the user's ACL. Gate release on these tests passing.
Enforce parent-subset scope checks and a maximum delegation depth at every spawn in the orchestrator. Test that over-scoped spawns are rejected and logged.
Scan every commit to agent code, prompts, and config for embedded secrets. Block merges on detection and triage findings to closure.
Vet and approve every MCP server and peer agent before registering its identity on the allow-list. Block integration until vetting is signed off.
Mint short-lived, task-scoped tokens just-in-time from a central token service. Enforce a hard max TTL and resource-bound audience so no standing credential exists.
Carry the invoking user's delegation context in every agent token via RFC 8693 'act' claims. Enforce the agent-user permission intersection at each resource server.
Grant sensitive scopes just-in-time for a bounded window with auto-revocation; require human approval for high-impact elevations. Hold zero standing privilege.
Issue only short-lived, auto-rotated credentials to agents via vault or SPIRE. Block any release whose configuration embeds a static secret.
Require mTLS with verified workload identities on every agent and MCP call. Deny any peer not on the approved allow-list.
Verify each running agent authenticates with its own SVID; revoke on decommission or compromise. Scan periodically for shared or static credentials and remediate.
Alert on wildcard, non-expiring, or reused tokens and revoke immediately. Review issuance patterns on a set cadence and tighten scopes where over-broad requests recur.
Alert on un-revoked elevations and any standing sensitive grant. Report the zero-standing-privilege position to the risk owner on a set cadence.
Sweep runtimes and repos on a schedule for static credentials. Alert on any credential exceeding its maximum age and track findings to closure.
Register a safety contract per integration — pinned version, schemas, side-effect class, latency/error envelope. Gate onboarding on contract review and sign-off.
Wire the agent tool layer to the CAB calendar at deployment. Test that a declared freeze blocks mutating calls before go-live.
Block out-of-contract calls in production and re-review the contract on any dependency version or behaviour change.
Block or downgrade agent-initiated mutating changes during declared freeze and high-risk windows. Permit overrides only via change-exception approval.
Third-party developer tools (IDE plugins, MCP servers) must not store or transmit long-lived provider API keys. Issue short-lived, scoped, revocable tokens via a broker/OAuth flow, and gate any first-time outbound transmission of secret-shaped data behind an explicit consent prompt — so a trojanized tool has no long-lived credential to exfiltrate and any attempt is visible.
Require authN/authZ on every inference API and MCP server, bind to private interfaces / front with a gateway, enforce network policy (no public exposure by default), and scope MCP tools to least privilege — so an exposed endpoint cannot be hijacked for compute resale, prompt/history exfiltration, or lateral movement. Pair with continuous asset discovery so endpoints can't drift back to an open default.
Treat each third-party AI integration as a privileged non-human principal: issue least-scope, IP/device-bound, short-lived grants (avoid 'full' scope and standing long-lived refresh tokens), instrument the integration's data egress for volume/object-breadth/destination anomalies, and maintain a tested one-move revocation path for all of an integration's tokens so a single vendor-side compromise cannot fan out into hundreds of standing footholds.
Do not store long-lived multi-provider LLM keys (or ambient cloud/K8s credentials) in the gateway/proxy's plaintext process environment. Issue short-lived, scoped tokens from a secret broker at request time, isolate the serving stack from host cloud/cluster credentials, and monitor per-provider spend and egress so a stolen key surfaces as anomalous usage — capping the loot a compromised gateway dependency can harvest.
When onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.
Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.
Limitation: Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Making sure the library only returns documents this particular user is allowed to see.
Limitation: Only as good as the permission model behind it; mis-tagged documents or coarse roles still over-share. Must be enforced server-side, not in the prompt.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.
Limitation: Validates form, not intent — a well-formed call to a permitted tool can still be the wrong call. Sandboxing adds latency and isn't always feasible for tools that touch production.
Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.
Limitation: Validates form, not intent — a well-formed call to a permitted tool can still be the wrong call. Sandboxing adds latency and isn't always feasible for tools that touch production.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.
Limitation: Validates form, not intent — a well-formed call to a permitted tool can still be the wrong call. Sandboxing adds latency and isn't always feasible for tools that touch production.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Treating add-on tool packs like software you vet: locking to a reviewed version and re-checking whenever it changes.
Limitation: Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.
Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.
Limitation: Validates form, not intent — a well-formed call to a permitted tool can still be the wrong call. Sandboxing adds latency and isn't always feasible for tools that touch production.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Treating add-on tool packs like software you vet: locking to a reviewed version and re-checking whenever it changes.
Limitation: Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Limitation: Doesn't prevent manipulation — only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
Tag every memory and vector record with subject-id and retention class; partition stores per tenant/user. Prove the erasure and isolation paths in testing before release.
Run TTL expiry and verified embedding erasure on production memory and vector stores. Re-certify partition isolation and the retention schedule with the DPO on a set cadence.
Run agent tool calls in a network-restricted sandbox behind a deny-by-default egress allow-list. Require security approval for any destination added.
Monitor blocked-egress events for exfiltration attempts and escalate confirmed cases. Recertify the destination allow-list on a defined cadence.
Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.
Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.
Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.
Run code-executing tools in ephemeral no-egress sandboxes with read-only filesystems, dropped capabilities, and resource limits. Permit network access only by explicit approved exception.
Review blocked tainted-derived calls as injection-attempt signals. Extend taint coverage to new tools and treat any tainted-derived execution as an incident.
Periodically exercise rollback paths and review logs for duplicate or unrecoverable actions. Treat failures as incidents and update integration specs.
Bind the agent's default execution target to non-production environments at design time. Require a separately approved promotion configuration for any production-connected target.
Run each agent task in an isolated, network-segmented sandbox scoped to the task's exact needs. Gate onboarding on fault-injection tests proving containment.
Engineer mutating actions with idempotency keys, transactions and pre-change snapshots; stage writes rather than committing directly. Gate release on tested dedup and rollback within RPO.
Default all deployments to non-production endpoints and credentials. Permit production promotion only via an explicit, approved configuration change.
Cap each agent's rate, volume, concurrency, and spend per downstream dependency. Trip the breaker and fail closed when a ceiling is crossed.
Enforce hard caps on iterations, depth, wall-clock, and cost per agent run. Terminate the run on cap breach or detected loop signatures.
Detect drift from the approved isolation baseline and alert on boundary widening. Re-test containment periodically and after infrastructure change.
Review trip events and tune ceilings via change control. Escalate repeated trips on the same dependency into incident management.
Review terminations to tune caps and add new loop signatures to the detector. Escalate recurring runaways to incident management.
Drill snapshot restores periodically and verify the RPO is met. Monitor mutating calls for duplicate-effect anomalies and log exceptions to the risk register.
Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.
On the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.
Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.
Limitation: Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance.
Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Limitation: Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Limitation: Adds coordination overhead and doesn't stop a worker from returning subtly wrong (but well-formed) results that mislead the planner.
Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Limitation: Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Limitation: Adds coordination overhead and doesn't stop a worker from returning subtly wrong (but well-formed) results that mislead the planner.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Limitation: Adds coordination overhead and doesn't stop a worker from returning subtly wrong (but well-formed) results that mislead the planner.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Limitation: Adds coordination overhead and doesn't stop a worker from returning subtly wrong (but well-formed) results that mislead the planner.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Limitation: Adds coordination overhead and doesn't stop a worker from returning subtly wrong (but well-formed) results that mislead the planner.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Limitation: Adds coordination overhead and doesn't stop a worker from returning subtly wrong (but well-formed) results that mislead the planner.
Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Limitation: Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Limitation: Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Limitation: Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.
Run a structured lessons-learned review after every material AI incident. Track remediation actions to closure and feed outcomes back into the controls and the IR plan.
Map notification obligations and timeframes at design and pre-approve templates with legal/compliance. Appoint the notification decision-owner before go-live.
Notify regulators, customers, and stakeholders of confirmed reportable incidents within statutory timeframes using pre-approved templates. Log every notification decision with timestamp and owner.
Monitor for privacy incidents in production including personal data appearing in outputs. Notify regulators within required timeframes.
Include the AI system in BCP and DRP. Define recovery procedures for AI components and test at least annually.
Monitor availability, latency, and error rates in production. Alert on SLA breaches and initiate incident response.
Define AI incident categories, severity tiers, and triage flow before go-live. Gate launch on governance approval of the plan and named roles.
Set the AI service's criticality tier, RTO/RPO, and degraded-mode service level at design with business sign-off. Register it in enterprise BCP scope.
Implement failover and degraded-mode mechanisms during build. Gate deployment on a continuity test proving recovery within RTO/RPO.
Wire detections into the IR queue and verify paging with a test escalation before go-live. Gate release on a successful dry-run.
Classify live incidents against the severity matrix and drill the plan periodically. Update and re-approve it after material changes or new incident types.
Hand every confirmed incident to the named IR team via the documented path within SLA. Track and escalate handoff breaches.
Invoke the BCP/DRP runbook on continuity-impacting incidents and measure recovery against RTO/RPO. Exercise the plan at least annually and track gaps to closure.
Periodically validate that deployed model versions remain reproducible. Test rollback procedures annually or after major updates.
Conduct periodic data leakage audits including training data memorisation testing. Escalate confirmed leakage incidents to PDPA notification process.
Implement tamper-evident capture of prompts, outputs, and version state during build. Verify a full incident timeline can be reconstructed before go-live.
Preserve prompts, outputs, logs, and model/data version state in tamper-evident storage on incident declaration. Maintain chain-of-custody and enforce the defined retention period.
Register each release as a restorable known-good baseline and rehearse rollback at the release gate. Block promotion without a tested restore.
Roll back to the last known-good state per the runbook on incident declaration. Validate recovery before resuming service.
Treat the model-serving runtime (Triton, vLLM, TGI, Ray Serve, etc.) as managed, attested, version-pinned inventory subject to a patch SLA; require the inference endpoint to be authenticated and network-segmented (never unauthenticated on an untrusted segment); and least-privilege the serving host's identity and egress so a runtime RCE cannot trivially exfiltrate models or pivot. Closes the gap that artifact-provenance controls leave open: integrity of the *data plane that runs the model*, not just of the model artifact.
Scope build identities least-privilege (read-only CI tokens; no standing release/publish rights bound to the merge path), require human review and SLSA-style provenance attestation before any external contribution becomes an official release, and verify signatures + provenance at the distribution channel and at install — so a merged pull request cannot become an authenticated, signed artifact without passing a review/provenance gate.
Gate every change to the system prompt / runtime config behind the same behavioural-regression and red-team-canary suite used for model changes; pin and provenance-track the prompt/config so 'what is live' is unambiguous and deprecated instructions cannot be silently reactivated; roll out to a canary cohort before full release so a disposition regression is caught on a small slice, not the whole public platform.
Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.