OmniDraft: A cross-vocabulary, online adaptive drafter for on-device speculative decoding

Jun-16-2026, 14:51:21 GMT–Neural Information Processing Systems

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the "one drafter for all" paradigm.

large language model, machine learning, target model, (19 more...)

Neural Information Processing Systems

Jun-16-2026, 14:51:21 GMT

Conferences PDF

Add feedback

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Industry:
- Information Technology > Security & Privacy (0.48)
- Education > Educational Setting
  - Online (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.94)
  - Natural Language
    - Large Language Model (0.69)
    - Machine Translation (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found