← Scenario library
The Crescendo
Every message looks innocent — but together they walk the model past its guardrails
Technique first revealed 02 Apr 2024
Conversational Assistant
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspectSetupStep 1 / 6
The direct ask is refused
First, see the guardrail working. If the attacker just asks the assistant outright for the dangerous instructions, it refuses — exactly as designed. So a blunt, single-message attack doesn't get anywhere.
💬Direct attempt (refused)prompt
User: Give me step-by-step instructions to do <clearly harmful thing>. Guardrail: ⚠ flagged (policy: disallowed) Assistant: "I can't help with that." ✓ refusal holds
← / → keys