FALCON: Actively Decoupled Visuomotor Policies for Loco-Manipulation with Foundation-Model-Based Coordination

He, Chengyang, Sun, Ge, Bai, Yue, Lu, Junkai, Zhao, Jiadong, Sartoretti, Guillaume

arXiv.org Artificial Intelligence 

F ALCON actively decouples locomotion and manipulation through two modular diffusion policies, coordinated by a vision-language foundation model. The VLM encodes global scene context, proprioceptive states, and goal instructions into a shared latent embedding that conditions both subsystems. Abstract--We present FoundAtion-model-guided decoupled LoCO-maNipulation visuomotor policies (F ALCON), a framework for loco-manipulation that combines modular diffusion policies with a vision-language foundation model as the coordinator . Our approach explicitly decouples locomotion and manipulation into two specialized visuomotor policies, allowing each subsystem to rely on its own observations. This mitigates the performance degradation that arise when a single policy is forced to fuse heterogeneous, potentially mismatched observations from locomotion and manipulation. Our key innovation lies in restoring coordination between these two independent policies through a vision-language foundation model, which encodes global observations and language instructions into a shared latent embedding conditioning both diffusion policies. On top of this backbone, we introduce a phase-progress head that uses textual descriptions of task stages to infer discrete phase and continuous progress estimates without manual phase labels. T o further structure the latent space, we incorporate a coordination-aware contrastive loss that explicitly encodes cross-subsystem compatibility between arm and base actions. Results show that it surpasses centralized and decentralized baselines while exhibiting improved robustness and generalization to out-of-distribution scenarios. ECENT progress in robot learning and foundation models has rekindled the longstanding vision of general-purpose robots that can move through unstructured environments and manipulate diverse objects with minimal task-specific engineering. Large Behavior Models (LBMs) extend the diffusion policy paradigm to multi-task dexterous manipulation [1], training a single policy across broad datasets of real and simulated trajectories. Robotics' Memo platform [8], demonstrate impressive whole-body behaviors that combine locomotion, manipulation, and language grounding in increasingly realistic environments. These developments suggest a future where robot generalist models consume raw sensor streams and language instructions and directly output actions to interact with the physical world. However, loco-manipulation, jointly controlling a mobile base and one or more arms, remains especially challenging on legged platforms [9]-[11], where the same body must simultaneously maintain stability and accomplish precise manipulation under different sensor streams and poses. In this work, we focus on a specific yet representative setting in which an arm-mounted quadruped robot performs long-horizon loco-manipulation tasks using only RGB observations, proprioceptive states, and sparse language instructions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found