ishan misra
- North America > United States (0.04)
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging
Meseguer-Brocal, Gabriel, Desblancs, Dorian, Hennequin, Romain
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.
- Media > Music (0.88)
- Leisure & Entertainment (0.88)
OmniMAE: Single Model Masked Pretraining on Images and Videos
Girdhar, Rohit, El-Nouby, Alaaeldin, Singh, Mannat, Alwala, Kalyan Vasudev, Joulin, Armand, Misra, Ishan
Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures. In particular, we show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark, setting a new state-of-the-art.
#206 - Ishan Misra: Self-Supervised Deep Learning in Computer Vision
Ishan Misra is a research scientist at FAIR working on self-supervised visual learning. Please support this podcast by checking out our sponsors: – Onnit: https://lexfridman.com/onnit to get up to 10% off – The Information: https://theinformation.com/lex to get 75% off first month – Grammarly: https://grammarly.com/lex to get 20% off premium – Athletic Greens: https://athleticgreens.com/lex and use code LEX to get 1 month of fish oil SUPPORT & CONNECT: – Check out the sponsors above, it's the best way to support this podcast – Support on Patreon: https://www.patreon.com/lexfridman On some podcast players you should be able to click the timestamp to jump to that time.