PooDLe: Pooled and dense self-supervised learning from naturalistic videos
Wang, Alex N., Hoang, Christopher, Xiong, Yuwen, LeCun, Yann, Ren, Mengye
–arXiv.org Artificial Intelligence
Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.
arXiv.org Artificial Intelligence
Aug-20-2024
- Country:
- Europe (0.04)
- Asia (0.04)
- Pacific Ocean > North Pacific Ocean
- San Francisco Bay (0.04)
- North America > United States
- New York (0.04)
- California > San Francisco County
- San Francisco (0.04)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Transportation > Ground > Road (0.46)
- Technology: