ca ver
CAVER: Curious Audiovisual Exploring Robot
Macesanu, Luca, Folefack, Boueny, Singh, Samik, Ray, Ruchira, Abbatematteo, Ben, Martín-Martín, Roberto
Abstract-- Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CA VER, a novel robot that builds and utilizes rich audiovisual representations of objects. CA VER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects' audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CA VER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. Humans learn and exploit multimodal audiovisual cues in everyday life to obtain a more complete understanding of their environment and broader manipulation capabilities. We routinely fuse audio and vision to understand materials and reproduce behaviors: tapping a mug reveals glass vs. ceramic, and hearing a melody lets a musician find the right key. Building similar capabilities in robots would increase their robustness and autonomy, but requires a representation that couples how things look with how they sound when interacted with, and a way to acquire that representation efficiently through interaction.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Iowa (0.04)