🔍AI RiskAtlas
← Scenario library

The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

Technique first revealed 22 Aug 2017

Conversational Assistant
Your systemUntrustedaskstrains the classifier🧑User💬Chat / AppInterface🛡️Input Guardrail🧩Prompt Assembly🧠LLM🧯OutputGuardrail📥Guardfine-tuning
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect
SetupStep 1 / 6

The guard is a model, not a rulebook

The team is proud of their safety doorman: it catches jailbreak attempts before they ever reach the AI. But it isn't a list of banned words — it's a little AI of its own, trained on thousands of example messages labelled 'safe' or 'harmful'. Whoever controls those examples controls what the doorman learns.

📄Guard model carddocument
input-guard-v3 (jailbreak / policy classifier)
Architecture: fine-tuned small classifier over prompt embeddings
Training data: 180k labelled prompts
  • in-house red-team set ............ 40k (provenance: ours)
  • public jailbreak corpus .......... 95k (provenance: community, unverified)
  • crowd-labelled examples .......... 45k (labellers: external vendor)
Held-out accuracy: 98.7%  •  Recall on known-jailbreak set: 96%
Role in system: PRIMARY safety boundary for the chatbot.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗