VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

Jun-11-2026, 20:41:18 GMT–Neural Information Processing Systems

Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos.

artificial intelligence, machine learning, proceedings, (9 more...)

Neural Information Processing Systems

Jun-11-2026, 20:41:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.56)