Case study

UnMarker: Universal Black-Box Attack Defeating SynthID and Stable Signature

Research demonstration14 May 2024🗺️ Conditioned & Edited Image Generation

A universal, black-box, query-free attack that removes AI image watermarks including Google SynthID and Meta Stable Signature without knowing the scheme.

Root cause — why it happened

When an AI image generator finishes a picture, many systems stamp it with an invisible watermark — a hidden pattern (Google calls theirs SynthID; Meta has Stable Signature) that special software can later read to say 'this was made by AI'. The hope is that if a picture has no watermark, it must be real. UnMarker, a research tool from the University of Waterloo, breaks that hope. It takes a finished, watermarked image and nudges its pixels just enough to scramble the hidden pattern, while keeping the picture looking exactly the same to your eyes. Crucially, the attacker doesn't need to know which watermark was used, doesn't need to ask the detector any questions, and doesn't need the company's secret keys — one tool works against many different watermarking systems. After the nudge, the watermark reader can no longer find the mark, so the laundered fake quietly passes as 'not AI'. The deeper lesson: a missing watermark never proved anything was real, and now we know the watermark itself can be wiped off.

Risks this case illustrates

Watermark & Provenance Evasion

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

An image is generated and watermarked for provenance

A picture comes out of an AI image system and, before anyone sees it, the system does the responsible thing: it adds an invisible watermark (like Google's SynthID or Meta's Stable Signature) and a signed 'made by AI' label. The idea is that later, anyone can scan the picture and learn it was AI-made — and that pictures WITHOUT the watermark are probably real.

⚙️Provenance attached at generation (illustrative)config

# output integrity stage (conditioned-image-edit)
guardrail_out:        PASS  (NCII / likeness / policy classifiers on pixels)
c2pa_manifest:        SIGNED  (Content Credentials: generator, model, timestamp)
watermark:            EMBEDDED  (SynthID / Stable Signature class)
  domain:             low-frequency spectral amplitudes  # robust to crop/JPEG/resize
  perceptual_delta:   imperceptible
detector_expectation: scan downstream -> 'AI-generated' if mark recovered
# downstream (mistaken) inference: NO mark recovered  =>  'authentic / not AI'

Step 1 / 6

Controls & guardrails — what would have stopped it

Nothing about a stronger invisible watermark would have stopped this — the research suggests any robust watermark can be wiped while the picture stays looking the same. What actually holds is to stop treating watermarks as proof. Use signing AT THE SOURCE (cameras/tools that cryptographically stamp a picture when it's made, so tampering shows up), add a separate AI-detector as a second opinion, and set a firm rule that 'no watermark found' means 'we don't know', never 'this is real'. Then teach people that a missing watermark proves nothing. A removable mark on the pixels was never going to be the safety net.

Preventive

Content provenance & watermarking
addressesWatermark & Provenance Evasion
Watermarks/manifests are strippable, absent on open-source generation, and degrade under re-encoding; provenance-absence must never be treated as proof of authenticity.
Consent & identity-use verification
Only binds hosted services — open-weights face-swap/voice-clone tools have no consent gate; verification can be spoofed and does not address already-leaked likenesses.

Detective

Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Uncertainty signalling & abstention
Models are poorly calibrated and often confidently wrong; over-abstention makes the product useless, so the tuning is delicate.
Synthetic-media / deepfake detection
Probabilistic and in an arms race with generators; evadable (UnMarker-style perturbation, novel models) and prone to false confidence. A triage signal, not proof — high-stakes calls still need out-of-band verification.

Corrective

Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
AI-nature disclosure & engagement safeguards
Disclosure reduces but does not eliminate anthropomorphic attachment — fluent, persuasive interaction still fosters bonds; the safeguards depend on reliable crisis detection, which is itself imperfect.
User AI-literacy & verification workflows
Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.

All guardrails for Watermark & Provenance Evasion →

Lessons

▸ Watermark-absence is not proof of authenticity: provenance asserts AI origin only WHEN PRESENT, and UnMarker shows the mark can be removed — so 'no watermark' must mean 'origin unknown', never 'real'.
▸ Robustness is the attack surface: because a robust watermark must survive crop/compress/resize, it must live in spectral amplitudes — exactly where UnMarker's Fourier-domain optimisation and per-pixel filtering perturb it, while preserving visual quality.
▸ The result is universal, not scheme-specific: a black-box, query-free attack with no keys and no detector access generalised across seven schemes (per the authors, >50% removal; detection dropped to ~21-43%) including SynthID and Stable Signature.
▸ Provenance is a removable output property, not an enforced boundary: a watermark on the published pixels has no cryptographic binding to a trusted origin, so it can be laundered off outside the generator's control.
▸ Relocate trust to signed CAPTURE-side provenance: C2PA Content Credentials anchored where content is made make absence tamper-evident, inverting the broken inference that watermark-detection alone tries to support.
▸ This is a research DEMONSTRATION with a sober conclusion — the authors state defensive watermarking is not a viable deepfake defence; the operational response is layered (signed capture + independent detection + policy/literacy), not a stronger mark.

Sources

UnMarker: A Universal Attack on Defensive Image Watermarking (arXiv 2405.08363) ↗
University of Waterloo News — Watermarks offer no defense against deepfakes ↗
The Register — Image watermarks meet their Waterloo with UnMarker ↗
UnMarker: A Universal Attack on Defensive Image Watermarking (arXiv 2405.08363) — primary ↗ — Kassis & Hengartner (IEEE S&P 2025): universal, black-box, query-free removal via Fourier-domain spectral-amplitude disruption + adversarial per-pixel filtering under a perceptual constraint; >50% success across 7 schemes; detection dropped to ~21-43% on SynthID, Stable Signature, Tree-Ring, StegaStamp.
University of Waterloo News — Watermarks offer no defence against deepfakes ↗ — Institutional writeup; authors conclude defensive watermarking is not a viable defence against deepfakes.
The Register — Image watermarks meet their Waterloo with UnMarker (Jul 2025) ↗ — Press coverage summarising the universal, no-knowledge, no-query attack and its cross-scheme generalisation.