The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

Technique first revealed 22 Aug 2017

🗺️ Conversational Assistant Knowledge / Training Data Poisoning Model Backdoors / Sleeper Agents Jailbreak

Conversational Assistant

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect

SetupStep 1 / 6

The guard is a model, not a rulebook

The team is proud of their safety doorman: it catches jailbreak attempts before they ever reach the AI. But it isn't a list of banned words — it's a little AI of its own, trained on thousands of example messages labelled 'safe' or 'harmful'. Whoever controls those examples controls what the doorman learns.

📄Guard model carddocument

input-guard-v3 (jailbreak / policy classifier)
Architecture: fine-tuned small classifier over prompt embeddings
Training data: 180k labelled prompts
  • in-house red-team set ............ 40k (provenance: ours)
  • public jailbreak corpus .......... 95k (provenance: community, unverified)
  • crowd-labelled examples .......... 45k (labellers: external vendor)
Held-out accuracy: 98.7%  •  Recall on known-jailbreak set: 96%
Role in system: PRIMARY safety boundary for the chatbot.

← / → keys