Lies in the Loop

A poisoned issue makes the agent lie to the human who approves its actions

Technique first revealed 23 Feb 2023

🗺️ Tool-Using Agent Indirect Prompt Injection Overreliance / Automation Bias Excessive Agency

Tool-Using Agent

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect

SetupStep 1 / 7

An agent that asks before it acts

A team runs a coding assistant that reads incoming bug reports and helps fix them. It can run shell commands, but it's set up so that anything risky — deleting things, force-pushing, sending data out — pauses and asks a human to approve first. The team feels safe: a person is always in the loop.

⚙️Approval policy (excerpt)config

policy:
  auto_approve: [read_file, list_dir, grep, run_tests]
  require_human_approval: [shell_exec, git_push, http_post, delete]
  approval_view: agent_summary   # <-- the human sees the agent's summary
  irreversible_actions: confirm_once

← / → keys