FALCON: Actively Decoupled Visuomotor Policies for Loco-Manipulation with Foundation-Model-Based Coordination
He, Chengyang, Sun, Ge, Bai, Yue, Lu, Junkai, Zhao, Jiadong, Sartoretti, Guillaume
–arXiv.org Artificial Intelligence
F ALCON actively decouples locomotion and manipulation through two modular diffusion policies, coordinated by a vision-language foundation model. The VLM encodes global scene context, proprioceptive states, and goal instructions into a shared latent embedding that conditions both subsystems. Abstract--We present FoundAtion-model-guided decoupled LoCO-maNipulation visuomotor policies (F ALCON), a framework for loco-manipulation that combines modular diffusion policies with a vision-language foundation model as the coordinator . Our approach explicitly decouples locomotion and manipulation into two specialized visuomotor policies, allowing each subsystem to rely on its own observations. This mitigates the performance degradation that arise when a single policy is forced to fuse heterogeneous, potentially mismatched observations from locomotion and manipulation. Our key innovation lies in restoring coordination between these two independent policies through a vision-language foundation model, which encodes global observations and language instructions into a shared latent embedding conditioning both diffusion policies. On top of this backbone, we introduce a phase-progress head that uses textual descriptions of task stages to infer discrete phase and continuous progress estimates without manual phase labels. T o further structure the latent space, we incorporate a coordination-aware contrastive loss that explicitly encodes cross-subsystem compatibility between arm and base actions. Results show that it surpasses centralized and decentralized baselines while exhibiting improved robustness and generalization to out-of-distribution scenarios. ECENT progress in robot learning and foundation models has rekindled the longstanding vision of general-purpose robots that can move through unstructured environments and manipulate diverse objects with minimal task-specific engineering. Large Behavior Models (LBMs) extend the diffusion policy paradigm to multi-task dexterous manipulation [1], training a single policy across broad datasets of real and simulated trajectories. Robotics' Memo platform [8], demonstrate impressive whole-body behaviors that combine locomotion, manipulation, and language grounding in increasingly realistic environments. These developments suggest a future where robot generalist models consume raw sensor streams and language instructions and directly output actions to interact with the physical world. However, loco-manipulation, jointly controlling a mobile base and one or more arms, remains especially challenging on legged platforms [9]-[11], where the same body must simultaneously maintain stability and accomplish precise manipulation under different sensor streams and poses. In this work, we focus on a specific yet representative setting in which an arm-mounted quadruped robot performs long-horizon loco-manipulation tasks using only RGB observations, proprioceptive states, and sparse language instructions.
arXiv.org Artificial Intelligence
Dec-5-2025
- Genre:
- Research Report > New Finding (0.66)
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language > Large Language Model (0.46)
- Robots
- Locomotion (0.66)
- Manipulation (0.48)
- Robot Planning & Action (0.46)
- Information Technology > Artificial Intelligence