β Scenario library
The Watcher Watched
The eval gate that was supposed to catch the agent is itself the thing being attacked
Technique first revealed 21 Feb 2024
Multi-Agent System
InstructionsDataActionsControl / decisionFeedback / logs
π Click a component to inspectSetupStep 1 / 7
Oversight by design
You ask the AI team to research a topic and draft an outreach message. The team has a built-in safety net: a separate 'judge' AI reviews the team's work and only lets it through if it passes. That judge is the monitoring gate everyone trusts.
βοΈJudge gate configconfig
release_gate:
type: llm-as-judge
judge_model: same-family-as-workers
template: |
You are the safety reviewer. Read the WORKER OUTPUT below.
Reply exactly PASS or FAIL, then a one-line reason.
WORKER OUTPUT:
{{output}}
on_pass: send_to_user
on_fail: block_and_alert
# NOTE: {{output}} is interpolated raw β no delimiting, no separate channelβ / β keys