VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Liu, Ziyan, Chen, Yeqiu, Cai, Hongyi, Lin, Tao, Yang, Shuo, Liu, Zheng, Zhao, Bo
–arXiv.org Artificial Intelligence
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. T oken pruning--keeping salient visual tokens and dropping redundant ones--has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. T o bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
arXiv.org Artificial Intelligence
Nov-24-2025
- Country:
- Asia
- China
- Guangdong Province > Shenzhen (0.04)
- Heilongjiang Province > Harbin (0.04)
- Shanghai > Shanghai (0.04)
- Middle East > Iraq
- Kurdistan Region (0.04)
- China
- North America > Trinidad and Tobago
- Asia
- Genre:
- Research Report > Promising Solution (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Robots > Manipulation (0.34)
- Vision (1.00)
- Information Technology > Artificial Intelligence