Detecting Benchmark Contamination Through Watermarking

Sander, Tom, Fernandez, Pierre, Mahloujifar, Saeed, Durmus, Alain, Guo, Chuan

Feb-24-2025–arXiv.org Artificial Intelligence

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, e.g. $p$-val $=10^{-3}$ for +5$\%$ on ARC-Easy.

benchmark, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Feb-24-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.34)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language > Large Language Model (0.95)
  - Security & Privacy (1.00)