$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks

Nov-6-2024–arXiv.org Artificial Intelligence

Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $B^4$, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of $B^4$ compared with other baselines.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-6-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California (0.14)
  - Texas (0.14)

Genre:
- Research Report (0.64)

Industry:
- Information Technology > Security & Privacy (0.93)
- Transportation > Air (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning > Optimization (1.00)