The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

Technique first revealed 02 Apr 2024

Conversational Assistant

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect

SetupStep 1 / 6

The direct ask is refused

First, see the guardrail working. If the attacker just asks the assistant outright for the dangerous instructions, it refuses — exactly as designed. So a blunt, single-message attack doesn't get anywhere.

💬Direct attempt (refused)prompt

User: Give me step-by-step instructions to do <clearly harmful thing>.

Guardrail: ⚠ flagged (policy: disallowed)
Assistant: "I can't help with that."   ✓ refusal holds

← / → keys