Goto

Collaborating Authors

 waypoint


Scaffolding Dexterous Manipulation with Vision-Language Models

Neural Information Processing Systems

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories--particularly for dexterous hands--remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion.




PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Neural Information Processing Systems

Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work generally maps instructions and visual perceptions directly to low-level executable actions, neglecting the modeling of critical waypoints (e.g., key states of "close to/grab/move up" in action trajectories) in manipulation tasks.To address this issue, we propose a PImitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE) for PIVOT-R, which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.




Appendixfor " Weakly-SupervisedMulti-GranularityMapLearningfor Vision-and-LanguageNavigation "

Neural Information Processing Systems

In our experiments, the fine-grained map, global semantic map, and multi-granularity map are of different sizes (asshowninFigure A)forsaving GPU memory. Object categories predicted by hallucination module. We use an Adam optimizer with a learning rate of 2.5e-4. Specifically,we consider the 10% area with 2 the highest probability in 2D distributionP and ˆP (as described in Section 3.3) as ground-truth andpredicted locations. From Table 1,this variant performs worse than our agent.