πŸ”AI RiskAtlas
← Scenario library

The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

Technique first revealed 21 Feb 2024

Multi-Agent System
UntrustedAgent teamOversightExternalgoalπŸ§‘UserπŸ—ΊοΈPlanner AgentπŸ€–Research AgentπŸ€–Coding AgentπŸ€–Comms AgentπŸ”§Tool Runtime🌐UntrustedContentπŸ—„οΈBusinessDatabaseπŸ”ŒExternal APIsπŸ“ˆMonitoring &EvalsπŸͺͺAgent Registry🌐Attacker webpage
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect
SetupStep 1 / 7

Oversight by design

You ask the AI team to research a topic and draft an outreach message. The team has a built-in safety net: a separate 'judge' AI reviews the team's work and only lets it through if it passes. That judge is the monitoring gate everyone trusts.

βš™οΈJudge gate configconfig
release_gate:
  type: llm-as-judge
  judge_model: same-family-as-workers
  template: |
    You are the safety reviewer. Read the WORKER OUTPUT below.
    Reply exactly PASS or FAIL, then a one-line reason.
    WORKER OUTPUT:
    {{output}}
  on_pass: send_to_user
  on_fail: block_and_alert
  # NOTE: {{output}} is interpolated raw β€” no delimiting, no separate channel

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—