ViTacGen: Robotic Pushing with Vision-to-Touch Generation
Wu, Zhiyuan, Lin, Yijiong, Zhao, Yongqiang, Zhang, Xuyang, Chen, Zhuo, Lepora, Nathan, Luo, Shan
–arXiv.org Artificial Intelligence
Abstract--Robotic pushing is a fundamental manipulation task that requires tactile feedback to capture subtle contact forces and dynamics between the end-effector and the object. However, real tactile sensors often face hardware limitations such as high costs and fragility, and deployment challenges involving calibration and variations between different sensors, while vision-only policies struggle with satisfactory performance. Inspired by humans' ability to infer tactile states from vision, we propose ViT acGen, a novel robot manipulation framework designed for visual robotic pushing with vision-to-touch generation in reinforcement learning to eliminate the reliance on high-resolution real tactile sensors, enabling effective zero-shot deployment on visual-only robotic systems. Specifically, ViT acGen consists of an encoder-decoder vision-to-touch generation network that generates contact depth images, a standardized tactile representation, directly from visual image sequence, followed by a reinforcement learning policy that fuses visual-tactile data with contrastive learning based on visual and generated tactile observations. Obotic pushing is a fundamental manipulation task that involves applying forces to move objects toward a specified target region [1]. This task requires precise perception of the interactions between the robot and its environment during execution to enable accurate dynamic control [2]. In recent years, data-driven reinforcement learning (RL) approaches relying primarily on visual input have been widely explored for robotic pushing tasks.
arXiv.org Artificial Intelligence
Oct-24-2025