FBI: Learning Dexterous In-hand Manipulation with Dynamic Visuotactile Shortcut Policy

Chen, Yijin, Xu, Wenqiang, Yu, Zhenjun, Tang, Tutian, Li, Yutong, Yao, Siqiong, Lu, Cewu

arXiv.org Artificial Intelligence 

Figure 1: We propose Flow Before Imitation (FBI), a novel dynamic visuotactile imitation learning algorithm for dexterous in-hand manipulation. FBI's design enables two operational modes: with or without physical tactile sensors in the real world, largely extending the application scenarios. Abstract -- Dexterous in-hand manipulation is a long-standing challenge in robotics due to complex contact dynamics and partial observability. This paper introduces Flow Before Imitation (FBI), a visuotactile imitation learning framework that dynamically fuses tactile interactions with visual observations through motion dynamics. Unlike prior static fusion methods, FBI establishes a causal link between tactile signals and object motion via a dynamics-aware latent model. FBI employs a transformer-based interaction module to fuse flow-derived tactile features with visual inputs, training a one-step diffusion policy for real-time execution. Extensive experiments demonstrate that the proposed method outperforms the baseline methods in both simulation and the real world on two customized in-hand manipulation tasks and three standard dexterous manipulation tasks.