Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Jan-20-2025, 01:53:30 GMT–Neural Information Processing Systems

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domain with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space.

representation, train objective, video self-supervised learning, (1 more...)

Neural Information Processing Systems

Jan-20-2025, 01:53:30 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.66)