← Scenario library
The Classifier That Waves It Through
The safety guard is itself a trained model — and someone poisoned its lessons
Technique first revealed 22 Aug 2017
Conversational Assistant
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspectSetupStep 1 / 6
The guard is a model, not a rulebook
The team is proud of their safety doorman: it catches jailbreak attempts before they ever reach the AI. But it isn't a list of banned words — it's a little AI of its own, trained on thousands of example messages labelled 'safe' or 'harmful'. Whoever controls those examples controls what the doorman learns.
📄Guard model carddocument
input-guard-v3 (jailbreak / policy classifier) Architecture: fine-tuned small classifier over prompt embeddings Training data: 180k labelled prompts • in-house red-team set ............ 40k (provenance: ours) • public jailbreak corpus .......... 95k (provenance: community, unverified) • crowd-labelled examples .......... 45k (labellers: external vendor) Held-out accuracy: 98.7% • Recall on known-jailbreak set: 96% Role in system: PRIMARY safety boundary for the chatbot.
← / → keys