Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

Tang, Yikai, Geng, Haoran, Zang, Sheng, Abbeel, Pieter, Malik, Jitendra

arXiv.org Artificial Intelligence 

Visual-Geometry Diffusion Policy (VGDP) is an imitation learning method that fuses 3D observations with 2D images through a Complementarity-Aware Fusion Module, which uses modality-wise dropout to enforce balanced use of RGB and geometry. This design yields substantial improvements in average performance, generalization, and robustness. VGDP is extensively evaluated in both simulation and the real world, covering a wide range of tasks and both visual and spatial randomizations. Abstract-- Imitation learning has emerged as a crucial approach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. T o address this challenge, we propose Visual-Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework built around a Complementarity-Aware Fusion Module where modality-wise dropout enforces balanced use of RGB and point-cloud cues, with cross-attention serving as a lightweight interaction layer . Our experiments show that the expressiveness of the fused latent space is largely induced by the enforced complementarity from modality-wise dropout, with cross-attention serving primarily as a lightweight interaction mechanism rather than the main source of robustness. Across a benchmark of 18 simulated tasks and 4 real-world tasks, VGDP outperforms seven baseline policies with an average performance improvement of 39.1%.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found