IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text
Moon, Seungwhan, Madotto, Andrea, Lin, Zhaojiang, Dirafzoon, Alireza, Saraf, Aparajita, Bearman, Amy, Damavandi, Babak
–arXiv.org Artificial Intelligence
ABSTRACT We present IMU2CLIP, a novel pre-training approach to align Inertial Measurement Unit (IMU) motion sensor recordings with video and text, by projecting them into the joint representation space of Contrastive Language-Image Pre-training (CLIP). The proposed approach allows IMU2CLIP to translate human motions (as measured by IMU sensors) into their corresponding textual descriptions and videos - while preserving the transitivity across these modalities. We explore several new IMU-based applications that IMU2CLIP enables, such as motion-based media retrieval and natural language reasoning tasks with motion data. In addition, we show that IMU2CLIP can significantly improve the downstream performance when fine-tuned for each application Figure 1: Illustration of IMU2CLIP (I2C): (a) The model aligns (e.g. Our code trained, IMU2CLIP is used as a retriever for both (b) IMU will be made publicly available.
arXiv.org Artificial Intelligence
Oct-25-2022