Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

Neural Information Processing Systems 

The pretraining data of today's strongest language models remains opaque, even when their parameters are open-sourced.In particular, little is known about the proportions of different domains, languages, or code represented in the data. While a long line of membership inference attacks aim to identify training examples on an instance level, they do not extend easily to statistics about the corpus. In this work, we tackle a task which we call, which aims to uncover the distributional make-up of the pretraining data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered vocabulary learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first token is the most common byte pair, the second is the most common pair after merging the first token, and so on.