Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Mahajan, Divyat, Goyal, Sachin, Idrissi, Badr Youbi, Pezeshki, Mohammad, Mitliagkas, Ioannis, Lopez-Paz, David, Ahuja, Kartik

Oct-17-2025–arXiv.org Artificial Intelligence

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-17-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada (0.28)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found