$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks
Huang, Baizhou, Pu, Xiao, Wan, Xiaojun
–arXiv.org Artificial Intelligence
Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $B^4$, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of $B^4$ compared with other baselines.
arXiv.org Artificial Intelligence
Nov-6-2024
- Country:
- North America > United States
- California (0.14)
- Texas (0.14)
- North America > United States
- Genre:
- Research Report (0.64)
- Industry:
- Information Technology > Security & Privacy (0.93)
- Transportation > Air (0.82)
- Technology: