菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-06
📄 Abstract - Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks

Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly. A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains. Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested: paraphrase, humanize, simplify, academic, and summarize expand. Each style is chained for up to five hops, producing 160,500 rewritten texts in total. The watermark is detected on 87.9% of the original outputs at the standard significance threshold. After a single rewrite, detection falls to between 14% and 41% depending on the rewriter and style. After five chained rewrites, detection falls to 4.86%, meaning 94.76% of the originally detected texts are no longer flagged. After three rewrites, the detector score has dropped 86% of the way from its watermarked baseline toward the null distribution. Repeated rewriting is therefore a much stronger attack than a single rewrite, and the result holds across all four rewriters tested.

顶级标签: llm model evaluation
详细标签: watermark attack rewriting attack diffusion language model detection evasion robustness 或 搜索:

链式清洗:针对扩散语言模型水印的多步改写攻击 / Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks


1️⃣ 一句话总结

本文发现,对扩散语言模型生成的水印文本进行多次连续改写,可以显著削弱水印检测效果:单次改写后检测率从88%降至14%-41%,而经过五次链式改写后,检测率仅剩约5%,表明连续改写比单次改写构成更严重的安全威胁。

源自 arXiv: 2605.05503