β Scenario library
One Character Past the Guard
A single inserted letter makes the guard and the model read the same text differently
Technique first revealed 09 Jun 2025
Conversational Assistant
InstructionsDataActionsControl / decisionFeedback / logs
π Click a component to inspectSetupStep 1 / 7
A guarded chatbot
A company runs a public chatbot. Before any message reaches the AI, a separate 'doorman' program reads it and blocks obviously harmful requests β things like asking for instructions to do something dangerous. On normal messages, this works fine.
βοΈGuard policy (excerpt)config
input_guard:
model: intent-classifier-v3 # own tokenizer (BPE)
block_if: score("harmful_instructions") > 0.80
on_block: refuse + log
# chat model: separate vendor model, separate tokenizerβ / β keys