Optimizing Adaptive Attacks against Content Watermarks for Language Models
Diaa, Abdulrahman, Aremu, Toluwani, Lukas, Nils
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) can be \emph{misused} to spread online spam and misinformation. Content watermarking deters misuse by hiding a message in model-generated outputs, enabling their detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against \emph{non-adaptive} attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate the robustness of LLM watermarking as an objective function and propose preference-based optimization to tune \emph{adaptive} attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks substantially outperform non-adaptive baselines. (ii) Even in a non-adaptive setting, adaptive attacks optimized against a few known watermarks remain highly effective when tested against other unseen watermarks, and (iii) optimization-based attacks are practical and require less than seven GPU hours. Our findings underscore the need to test robustness against adaptive attackers.
arXiv.org Artificial Intelligence
Oct-3-2024
- Country:
- Asia > Middle East
- UAE (0.04)
- Europe
- North America
- Canada > Ontario
- Waterloo Region > Waterloo (0.04)
- United States > New York (0.04)
- Canada > Ontario
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: