HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Xie, Zhinan, Wang, Peisong, Qiu, Shuang, Cheng, Jian

Nov-21-2025–arXiv.org Artificial Intelligence

Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual T okens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.

drafter, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Nov-21-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found