PooDLe: Pooled and dense self-supervised learning from naturalistic videos

Wang, Alex N., Hoang, Christopher, Xiong, Yuwen, LeCun, Yann, Ren, Mengye

Aug-20-2024–arXiv.org Artificial Intelligence

Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

objective, representation, subcrop, (10 more...)

arXiv.org Artificial Intelligence

Aug-20-2024

arXiv.org PDF

Add feedback

Country:
- Europe (0.04)
- Asia (0.04)
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay (0.04)
- North America > United States
  - New York (0.04)
  - California > San Francisco County
    - San Francisco (0.04)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Transportation > Ground > Road (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Machine Learning > Inductive Learning (0.85)