Jenatton, Rodolphe
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Kossen, Jannik, Collier, Mark, Mustafa, Basil, Wang, Xiao, Zhai, Xiaohua, Beyer, Lucas, Steiner, Andreas, Berent, Jesse, Jenatton, Rodolphe, Kokiopoulou, Efi
We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits from training the image tower contrastively. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining.
Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels
Wang, Ke, Ortiz-Jimenez, Guillermo, Jenatton, Rodolphe, Collier, Mark, Kokiopoulou, Efi, Frossard, Pascal
Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e.g., +6.8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.
When does Privileged Information Explain Away Label Noise?
Ortiz-Jimenez, Guillermo, Collier, Mark, Nawalgaria, Anant, D'Amour, Alexander, Berent, Jesse, Jenatton, Rodolphe, Kokiopoulou, Effrosyni
Leveraging privileged information (PI), or features available during training but not at test time, has recently been shown to be an effective method for addressing label noise. However, the reasons for its effectiveness are not well understood. In this study, we investigate the role played by different properties of the PI in explaining away label noise. Through experiments on multiple datasets with real PI (CIFAR-N/H) and a new large-scale benchmark ImageNet-PI, we find that PI is most helpful when it allows networks to easily distinguish clean from noisy data, while enabling a learning shortcut to memorize the noisy examples. Interestingly, when PI becomes too predictive of the target label, PI methods often perform worse than their no-PI baselines. Based on these findings, we propose several enhancements to the state-of-the-art PI methods and demonstrate the potential of PI as a means of tackling label noise. Finally, we show how we can easily combine the resulting PI approaches with existing no-PI techniques designed to deal with label noise.
Scaling Vision Transformers to 22 Billion Parameters
Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, Jenatton, Rodolphe, Beyer, Lucas, Tschannen, Michael, Arnab, Anurag, Wang, Xiao, Riquelme, Carlos, Minderer, Matthias, Puigcerver, Joan, Evci, Utku, Kumar, Manoj, van Steenkiste, Sjoerd, Elsayed, Gamaleldin F., Mahendran, Aravindh, Yu, Fisher, Oliver, Avital, Huot, Fantine, Bastings, Jasmijn, Collier, Mark Patrick, Gritsenko, Alexey, Birodkar, Vighnesh, Vasconcelos, Cristina, Tay, Yi, Mensink, Thomas, Kolesnikov, Alexander, Pavetiฤ, Filip, Tran, Dustin, Kipf, Thomas, Luฤiฤ, Mario, Zhai, Xiaohua, Keysers, Daniel, Harmsen, Jeremiah, Houlsby, Neil
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
Massively Scaling Heteroscedastic Classifiers
Collier, Mark, Jenatton, Rodolphe, Mustafa, Basil, Houlsby, Neil, Berent, Jesse, Kokiopoulou, Effrosyni
Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.
Sparse MoEs meet Efficient Ensembles
Allingham, James Urquhart, Wenzel, Florian, Mariet, Zelda E, Mustafa, Basil, Puigcerver, Joan, Houlsby, Neil, Jerfel, Ghassen, Fortuin, Vincent, Lakshminarayanan, Balaji, Snoek, Jasper, Tran, Dustin, Ruiz, Carlos Riquelme, Jenatton, Rodolphe
Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, lead to strong performance. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that these two approaches have complementary features whose combination is beneficial. Then, we present partitioned batch ensembles, an efficient ensemble of sparse MoEs that takes the best of both classes of models. Extensive experiments on fine-tuned vision transformers demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty calibration improvements of our approach over several challenging baselines. Partitioned batch ensembles not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.
Deep Classifiers with Label Noise Modeling and Distance Awareness
Fortuin, Vincent, Collier, Mark, Wenzel, Florian, Allingham, James, Liu, Jeremiah, Tran, Dustin, Lakshminarayanan, Balaji, Berent, Jesse, Jenatton, Rodolphe, Kokiopoulou, Effrosyni
Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two complementary types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, Imagenet-C, and Imagenet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which adds an additional type of uncertainty and also outperforms other ensemble baselines.
Scaling Vision with Sparse Mixture of Experts
Riquelme, Carlos, Puigcerver, Joan, Mustafa, Basil, Neumann, Maxim, Jenatton, Rodolphe, Pinto, Andrรฉ Susano, Keysers, Daniel, Houlsby, Neil
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization
Perrone, Valerio, Shen, Huibin, Zolic, Aida, Shcherbatyi, Iaroslav, Ahmed, Amr, Bansal, Tanya, Donini, Michele, Winkelmolen, Fela, Jenatton, Rodolphe, Faddoul, Jean Baptiste, Pogorzelska, Barbara, Miladinovic, Miroslav, Kenthapadi, Krishnaram, Seeger, Matthias, Archambeau, Cรฉdric
Tuning complex machine learning systems is challenging. Machine learning models typically expose a set of hyperparameters, be it regularization, architecture, or optimization parameters, whose careful tuning is critical to achieve good performance. To democratize access to such systems, it is essential to automate this tuning process. This paper presents Amazon SageMaker Automatic Model Tuning (AMT), a fully managed system for black-box optimization at scale. AMT finds the best version of a machine learning model by repeatedly training it with different hyperparameter configurations. It leverages either random search or Bayesian optimization to choose the hyperparameter values resulting in the best-performing model, as measured by the metric chosen by the user. AMT can be used with built-in algorithms, custom algorithms, and Amazon SageMaker pre-built containers for machine learning frameworks. We discuss the core functionality, system architecture and our design principles. We also describe some more advanced features provided by AMT, such as automated early stopping and warm-starting, demonstrating their benefits in experiments.
Hyperparameter Ensembles for Robustness and Uncertainty Quantification
Wenzel, Florian, Snoek, Jasper, Tran, Dustin, Jenatton, Rodolphe
Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For best performance independent of budget, we propose hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations. Its strong performance highlights the benefit of combining models with both weight and hyperparameter diversity. We further propose a parameter efficient version, hyper-batch ensembles, which builds on the layer structure of batch ensembles and self-tuning networks. The computational and memory costs of our method are notably lower than typical ensembles. On image classification tasks, with MLP, LeNet, ResNet 20 and Wide ResNet 28-10 architectures, we improve upon both deep and batch ensembles.