ACG: Action Coherence Guidance for Flow-based VLA models

Park, Minho, Kim, Kinam, Hyung, Junha, Jang, Hyojin, Jin, Hoiyeong, Yun, Jooyeol, Lee, Hojoon, Choo, Jaegul

arXiv.org Artificial Intelligence 

Abstract-- Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Y et, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks. Diffusion and flow matching models are reshaping how robots learn to manipulate objects [1]. These generative models act as robot policies that directly model complex action distributions from human demonstrations, enabling strong generalization across diverse manipulation tasks.