On multi-token prediction for efficient LLM inference

Mehra, Somesh, Garcia, Javier Alonso, Mauch, Lukas

Feb-13-2025–arXiv.org Artificial Intelligence

We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction. In recent years, decoder-only transformers have emerged as the state-of-the-art models for language modeling and are widely adopted for large language models (LLMs).

backbone, prediction, token probability, (16 more...)

arXiv.org Artificial Intelligence

Feb-13-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.05)
- Europe > Switzerland
  - Vaud > Lausanne (0.04)
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre:
- Research Report > Promising Solution (0.49)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found