Latent Representations for Visual Proprioception in Inexpensive Robots

Sheikholeslami, Sahara, Bölöni, Ladislau

arXiv.org Artificial Intelligence 

Abstract: Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, V AEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot. 1 Introduction Proprioception is the task of recovering the configuration of the robot from its own sensors, in contrast to perception, which is directed towards the external reality. In some settings, proprioception is an engineering problem solved by the internal sensors of the robot. For instance, high-quality industrial robots are so precisely actuated that we can safely consider their joint configurations known. However, for certain scenarios, such as inexpensive robots operating in unstructured environments, the proprioception information coming from the robot might be noisy, uncertain, or unreliable. These robots might be controlled through policies based on end-to-end reinforcement learning or imitation learning that define actions as functions of an external observation a π (o), which appears to sidestep the proprioception problem. In practice, however, if some internal proprioception is available, this can be combined with the results of the external perception, in the hope that explicit proprioceptive data can support task performance.