Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Huang, Yukun, Chen, Sanxing, Pei, Jian, Zaheer, Manzil, Dhingra, Bhuwan
–arXiv.org Artificial Intelligence
Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
arXiv.org Artificial Intelligence
Oct-30-2025
- Country:
- Africa > Rwanda
- Asia
- Europe
- Austria > Vienna (0.15)
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Italy > Tuscany
- Florence (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- United Kingdom > England
- Leicestershire > Leicester (0.04)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Mexico (0.04)
- United States
- District of Columbia > Washington (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Canada
- South America > Paraguay
- Genre:
- Research Report (0.52)
- Industry:
- Materials > Chemicals
- Commodity Chemicals (0.46)
- Media (0.46)
- Materials > Chemicals
- Technology: