A Watermark for Black-Box Language Models

Bahri, Dara, Wieting, John, Alon, Dana, Metzler, Donald

arXiv.org Artificial Intelligence 

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require whitebox access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments. It can be critical to understand whether a piece of text is generated by a large language model (LLM). For instance, one often wants to know how trustworthy a piece of text is, and those written by an LLM may be deemed untrustworthy as these models can hallucinate. This problem comes in different flavors -- one may want to detect whether it was generated by a specific model or by any model. Furthermore, the detecting party may or may not have white-box access (e.g. an ability to compute log-probabilities) to the generator they wish to test against. Typically, parties that have white-box access are the owners of the model so we refer to this case as first-party detection and the counterpart as third-party detection. The goal of watermarking is to cleverly bias the generator so that first-party detection becomes easier. Most proposed techniques do not modify the underlying LLM's model weights or its training procedure but rather inject the watermark during autoregressive decoding at inference time. They require access to the next-token logits and inject the watermark every step of the sampling loop. This required access prevents third-party users of an LLM from applying their own watermark as proprietary APIs currently do not support this option. Supporting this functionality presents a security risk in addition to significant engineering considerations.