10e6dfea9a673bef4a7b1cb9234891bc-Paper-Conference.pdf

Jun-2-2025, 11:33:11 GMT–Neural Information Processing Systems

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Jun-2-2025, 11:33:11 GMT

Conferences PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- North America > United States
  - California (0.14)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
10e6dfea9a673bef4a7b1cb9234891bc-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found