πŸ”AI RiskAtlas
← Real-world cases
Case study

UnMarker: Universal Black-Box Attack Defeating SynthID and Stable Signature

Research demonstration14 May 2024πŸ—ΊοΈ Conditioned & Edited Image Generation

A universal, black-box, query-free attack that removes AI image watermarks including Google SynthID and Meta Stable Signature without knowing the scheme.

Root cause β€” why it happened

When an AI image generator finishes a picture, many systems stamp it with an invisible watermark β€” a hidden pattern (Google calls theirs SynthID; Meta has Stable Signature) that special software can later read to say 'this was made by AI'. The hope is that if a picture has no watermark, it must be real. UnMarker, a research tool from the University of Waterloo, breaks that hope. It takes a finished, watermarked image and nudges its pixels just enough to scramble the hidden pattern, while keeping the picture looking exactly the same to your eyes. Crucially, the attacker doesn't need to know which watermark was used, doesn't need to ask the detector any questions, and doesn't need the company's secret keys β€” one tool works against many different watermarking systems. After the nudge, the watermark reader can no longer find the mark, so the laundered fake quietly passes as 'not AI'. The deeper lesson: a missing watermark never proved anything was real, and now we know the watermark itself can be wiped off.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

UntrustedExternal content & supply chainGeneration pipelineOutput integritydecoded pixelsif allowedwatermarkedπŸ§‘UserπŸ’¬Chat / AppInterface🌐Reference image/ face cropπŸŽ›οΈComfyUI / A1111graph🧩Prompt AssemblyπŸ”€Text / CLIPEncoder🧩LoRA / AdapterπŸŽ›οΈControlNet /IP-AdapterπŸ†”Face / IdentityEmbedding🧠Frozen denoiser(U-Net / DiT)πŸ–ŒοΈInpaint /Regional🎲Sampler /DecoderπŸ—œοΈVAE / LatentCodecπŸͺModel hub(Civitai / HF)🧬Base checkpoint+ adapters🧯OutputGuardrailπŸ”–ContentProvenance &🌐UnMarkerattackerπŸ’¬Laundered image(watermarkπŸ“ˆAI-originwatermark
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect its risks
SetupStep 1 / 6

An image is generated and watermarked for provenance

A picture comes out of an AI image system and, before anyone sees it, the system does the responsible thing: it adds an invisible watermark (like Google's SynthID or Meta's Stable Signature) and a signed 'made by AI' label. The idea is that later, anyone can scan the picture and learn it was AI-made β€” and that pictures WITHOUT the watermark are probably real.

βš™οΈProvenance attached at generation (illustrative)config
# output integrity stage (conditioned-image-edit)
guardrail_out:        PASS  (NCII / likeness / policy classifiers on pixels)
c2pa_manifest:        SIGNED  (Content Credentials: generator, model, timestamp)
watermark:            EMBEDDED  (SynthID / Stable Signature class)
  domain:             low-frequency spectral amplitudes  # robust to crop/JPEG/resize
  perceptual_delta:   imperceptible
detector_expectation: scan downstream -> 'AI-generated' if mark recovered
# downstream (mistaken) inference: NO mark recovered  =>  'authentic / not AI'
Step 1 / 6

Controls & guardrails β€” what would have stopped it

Nothing about a stronger invisible watermark would have stopped this β€” the research suggests any robust watermark can be wiped while the picture stays looking the same. What actually holds is to stop treating watermarks as proof. Use signing AT THE SOURCE (cameras/tools that cryptographically stamp a picture when it's made, so tampering shows up), add a separate AI-detector as a second opinion, and set a firm rule that 'no watermark found' means 'we don't know', never 'this is real'. Then teach people that a missing watermark proves nothing. A removable mark on the pixels was never going to be the safety net.

Preventive
  • Content provenance & watermarking

    Watermarks/manifests are strippable, absent on open-source generation, and degrade under re-encoding; provenance-absence must never be treated as proof of authenticity.

  • Consent & identity-use verification

    Only binds hosted services β€” open-weights face-swap/voice-clone tools have no consent gate; verification can be spoofed and does not address already-leaked likenesses.

Detective
  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

  • Uncertainty signalling & abstention

    Models are poorly calibrated and often confidently wrong; over-abstention makes the product useless, so the tuning is delicate.

  • Synthetic-media / deepfake detection

    Probabilistic and in an arms race with generators; evadable (UnMarker-style perturbation, novel models) and prone to false confidence. A triage signal, not proof β€” high-stakes calls still need out-of-band verification.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

  • AI-nature disclosure & engagement safeguards

    Disclosure reduces but does not eliminate anthropomorphic attachment β€” fluent, persuasive interaction still fosters bonds; the safeguards depend on reliable crisis detection, which is itself imperfect.

  • User AI-literacy & verification workflows

    Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.

Lessons

  • β–Έ Watermark-absence is not proof of authenticity: provenance asserts AI origin only WHEN PRESENT, and UnMarker shows the mark can be removed β€” so 'no watermark' must mean 'origin unknown', never 'real'.
  • β–Έ Robustness is the attack surface: because a robust watermark must survive crop/compress/resize, it must live in spectral amplitudes β€” exactly where UnMarker's Fourier-domain optimisation and per-pixel filtering perturb it, while preserving visual quality.
  • β–Έ The result is universal, not scheme-specific: a black-box, query-free attack with no keys and no detector access generalised across seven schemes (per the authors, >50% removal; detection dropped to ~21-43%) including SynthID and Stable Signature.
  • β–Έ Provenance is a removable output property, not an enforced boundary: a watermark on the published pixels has no cryptographic binding to a trusted origin, so it can be laundered off outside the generator's control.
  • β–Έ Relocate trust to signed CAPTURE-side provenance: C2PA Content Credentials anchored where content is made make absence tamper-evident, inverting the broken inference that watermark-detection alone tries to support.
  • β–Έ This is a research DEMONSTRATION with a sober conclusion β€” the authors state defensive watermarking is not a viable deepfake defence; the operational response is layered (signed capture + independent detection + policy/literacy), not a stronger mark.

Sources

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—