Watermarks in the Sand: Impossibility of Strong LLM Watermarking

Research demonstration07 Nov 2023

Zhang, Edelman, Francati, Venturi, Ateniese and Barak (ICML 2024) prove that strong watermarking for generative models is theoretically impossible under natural assumptions, even when the detector uses a secret key. The proof is constructive: a generic removal attack needs only black-box access to the watermarked model plus a much weaker open-source model used as a quality oracle and a quality-preserving random-walk perturbation oracle, with no knowledge of the secret key. The authors instantiate it to strip the watermarks of Kirchenbauer et al. (2023), Kuditipudi et al. (2023) and Zhao et al. (2023) with only minor quality degradation, showing provenance watermarks on text are evadable rather than tamper-proof.

Risks it illustrates

Watermark & Provenance Evasion

Sources

More cases on Watermark & Provenance Evasion

UnMarker: Universal Black-Box Attack Defeating SynthID and Stable Signature