MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

Bardes, Adrien, Ponce, Jean, LeCun, Yann

arXiv.org Artificial Intelligence 

Self-supervised learning of visual representations has been focusing on learning content features, which do not capture object motion or location, and focus on identifying and differentiating objects in images and videos. On the other hand, optical flow estimation is a task that does not involve understanding the content of the images on which it is estimated. We unify the two approaches and introduce MC-JEPA, a joint-embedding predictive architecture and self-supervised learning approach to jointly learn optical flow and content features within a shared encoder, demonstrating that the two associated objectives; the optical flow estimation objective and the self-supervised learning objective; benefit from each other and thus learn content features that incorporate motion information. The proposed approach achieves performance on-par with existing unsupervised optical flow benchmarks, as well as with common self-supervised learning approaches on downstream tasks such as semantic segmentation of images and videos. Most methods focus on learning global features that achieve strong results in tasks such as object classification or action recognition in videos. A more recent trend aims at learning localized features, that perform well on local tasks such as detection and segmentation (Xiao et al., 2021; Wang et al., 2021; Hénaff et al., 2021; 2022; Bardes et al., 2022b). However, these methods focus on understanding the content of images and videos and are not able to learn information at the pixel level, such as motion in videos or details in textures. In this paper, we focus on jointly learning motion features by using self-supervised optical flow estimation (Horn & Schunck., 1981) from videos as a pretext task, and content features with general self-supervised learning. The Optical flow captures the motion, or dense-pixel correspondence, that occurs between two images, for instance consecutive frames in a video, or images from a stereo pair. Estimating it is a fundamental problem in computer vision, whose solution is key to tasks such as visual odometry, depth estimation, or object tracking.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found