FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Liu, Yicheng, Zhang, Shiduo, Dong, Zibin, Ye, Baijun, Yuan, Tianyuan, Yu, Xiaopeng, Yin, Linqi, Lu, Chenhao, Shi, Junhao, Yu, Luca Jiang-Tao, Zheng, Liangtao, Jiang, Tao, Gong, Jingjing, Qiu, Xipeng, Zhao, Hang

arXiv.org Artificial Intelligence 

UCSD Figure 1: F AST er combines a learnable action tokenizer (FASTerVQ) and an autoregressive VLA model (FASTerVLA), achieving efficient compression, fast control, and strong performance across eight real and simulated embodiments. Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce F AST er, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autore-gressive policy built upon it. FASTerVLA builds on this tokenizer with block-wise autore-gressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance. Vision-Language-Action (VLA) models represent a paradigm shift in robotics, embodying generalist robot policies trained on increasingly large-scale robotic datasets (Chenjia Bai, 2024). These models are categorized primarily by their method of robot action prediction, with the most prominent approaches being diffusion-based (Team et al., 2024; Black et al., 2024) and autoregressive VLA (Belkhale & Sadigh, 2024; Kim et al., 2024; Pertsch et al., 2025; Zhou et al., 2025) models. While diffusion-based models have demonstrated superior precision in manipulation tasks, they often exhibit a notable deficiency in leveraging critical visual and linguistic cues (Pertsch et al., 2025; Dong et al., 2025). In contrast, recent research indicates that a carefully designed autoregres-sive VLA model can increasingly bridge the performance gap with its diffusion-based counterparts, while simultaneously offering enhanced instruction-following capabilities (Pertsch et al., 2025; Intelligence et al., 2025; Hancock et al., 2025), superior scene generalization (Pertsch et al., 2025), and effective transfer of common-sense knowledge (Brohan et al., 2023). Most importantly, autoregres-sive VLA models share the most architectural similarity to the highly successful Vision-Language Models (VLMs), suggesting significant potential for future advancements. A pivotal challenge within autoregressive VLA models is the development of an appropriate tok-enization scheme to discretize continuous robot action sequence into action tokens (Wang et al., 2025c; Pertsch et al., 2025). Numerous sequence modeling studies, including LLMs and Speech-LLMs, have demonstrated that tokenizer quality directly determines model performance (Radford et al., 2019; Zhang et al., 2023; Gong et al., 2025).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found