Alabdulmohsin, Ibrahim
FlexiViT: One Model for All Patch Sizes
Beyer, Lucas, Izmailov, Pavel, Kolesnikov, Alexander, Caron, Mathilde, Kornblith, Simon, Zhai, Xiaohua, Minderer, Matthias, Tschannen, Michael, Alabdulmohsin, Ibrahim, Pavetic, Filip
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
Scaling Vision Transformers to 22 Billion Parameters
Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, Jenatton, Rodolphe, Beyer, Lucas, Tschannen, Michael, Arnab, Anurag, Wang, Xiao, Riquelme, Carlos, Minderer, Matthias, Puigcerver, Joan, Evci, Utku, Kumar, Manoj, van Steenkiste, Sjoerd, Elsayed, Gamaleldin F., Mahendran, Aravindh, Yu, Fisher, Oliver, Avital, Huot, Fantine, Bastings, Jasmijn, Collier, Mark Patrick, Gritsenko, Alexey, Birodkar, Vighnesh, Vasconcelos, Cristina, Tay, Yi, Mensink, Thomas, Kolesnikov, Alexander, Pavetiฤ, Filip, Tran, Dustin, Kipf, Thomas, Luฤiฤ, Mario, Zhai, Xiaohua, Keysers, Daniel, Harmsen, Jeremiah, Houlsby, Neil
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
Adapting to Latent Subgroup Shifts via Concepts and Proxies
Alabdulmohsin, Ibrahim, Chiou, Nicole, D'Amour, Alexander, Gretton, Arthur, Koyejo, Sanmi, Kusner, Matt J., Pfohl, Stephen R., Salaudeen, Olawale, Schrouff, Jessica, Tsai, Katherine
We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.
Layer-Stack Temperature Scaling
Khalifa, Amr, Mozer, Michael C., Sedghi, Hanie, Neyshabur, Behnam, Alabdulmohsin, Ibrahim
Recent works demonstrate that early layers in a neural network contain useful information for prediction. Inspired by this, we show that extending temperature scaling across all layers improves both calibration and accuracy. We call this procedure "layer-stack temperature scaling" (LATES). Informally, LATES grants each layer a weighted vote during inference. We evaluate it on five popular convolutional neural network architectures both in- and out-of-distribution and observe a consistent improvement over temperature scaling in terms of accuracy, calibration, and AUC. All conclusions are supported by comprehensive statistical analyses. Since LATES neither retrains the architecture nor introduces many more parameters, its advantages can be reaped without requiring additional data beyond what is used in temperature scaling. Finally, we show that combining LATES with Monte Carlo Dropout matches state-of-the-art results on CIFAR10/100.
Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?
Schrouff, Jessica, Harris, Natalie, Koyejo, Oluwasanmi, Alabdulmohsin, Ibrahim, Schnider, Eva, Opsahl-Ong, Krista, Brown, Alex, Roy, Subhrajit, Mincu, Diana, Chen, Christina, Dieng, Awa, Liu, Yuan, Natarajan, Vivek, Karthikesalingam, Alan, Heller, Katherine, Chiappa, Silvia, D'Amour, Alexander
Fairness and robustness are often considered as orthogonal dimensions when evaluating machine learning models. However, recent work has revealed interactions between fairness and robustness, showing that fairness properties are not necessarily maintained under distribution shift. In healthcare settings, this can result in e.g. a model that performs fairly according to a selected metric in "hospital A" showing unfairness when deployed in "hospital B". While a nascent field has emerged to develop provable fair and robust models, it typically relies on strong assumptions about the shift, limiting its impact for real-world applications. In this work, we explore the settings in which recently proposed mitigation strategies are applicable by referring to a causal framing. Using examples of predictive models in dermatology and electronic health records, we show that real-world applications are complex and often invalidate the assumptions of such methods. Our work hence highlights technical, practical, and engineering gaps that prevent the development of robustly fair machine learning models for real-world applications. Finally, we discuss potential remedies at each step of the machine learning pipeline.
A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models
Alabdulmohsin, Ibrahim, Lucic, Mario
We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
What Do Neural Networks Learn When Trained With Random Labels?
Maennel, Hartmut, Alabdulmohsin, Ibrahim, Tolstikhin, Ilya, Baldock, Robert J. N., Bousquet, Olivier, Gelly, Sylvain, Keysers, Daniel
We study deep neural networks (DNNs) trained on natural image data with entirely random labels. Despite its popularity in the literature, where it is often used to study memorization, generalization, and other phenomena, little is known about what DNNs learn in this setting. In this paper, we show analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels. We study this alignment effect by investigating neural networks pre-trained on randomly labelled image data and subsequently fine-tuned on disjoint datasets with random or real labels. We show how this alignment produces a positive transfer: networks pre-trained with random labels train faster downstream compared to training from scratch even after accounting for simple effects, such as weight scaling. We analyze how competing effects, such as specialization at later layers, may hide the positive transfer. These effects are studied in several network architectures, including VGG16 and ResNet18, on CIFAR10 and ImageNet.
Fair Classification via Unconstrained Optimization
Alabdulmohsin, Ibrahim
Achieving the Bayes optimal binary classification rule subject to group fairness constraints is known to be reducible, in some cases, to learning a group-wise thresholding rule over the Bayes regressor. In this paper, we extend this result by proving that, in a broader setting, the Bayes optimal fair learning rule remains a group-wise thresholding rule over the Bayes regressor but with a (possible) randomization at the thresholds. This provides a stronger justification to the post-processing approach in fair classification, in which (1) a predictor is learned first, after which (2) its output is adjusted to remove bias. We show how the post-processing rule in this two-stage approach can be learned quite efficiently by solving an unconstrained optimization problem. The proposed algorithm can be applied to any black-box machine learning model, such as deep neural networks, random forests and support vector machines. In addition, it can accommodate many fairness criteria that have been previously proposed in the literature, such as equalized odds and statistical parity. We prove that the algorithm is Bayes consistent and motivate it, furthermore, via an impossibility result that quantifies the tradeoff between accuracy and fairness across multiple demographic groups. Finally, we conclude by validating the algorithm on the Adult benchmark dataset.
Uniform Generalization, Concentration, and Adaptive Learning
Alabdulmohsin, Ibrahim
One fundamental goal in any learning algorithm is to mitigate its risk for overfitting. Mathematically, this requires that the learning algorithm enjoys a small generalization risk, which is defined either in expectation or in probability. Both types of generalization are commonly used in the literature. For instance, generalization in expectation has been used to analyze algorithms, such as ridge regression and SGD, whereas generalization in probability is used in the VC theory, among others. Recently, a third notion of generalization has been studied, called uniform generalization, which requires that the generalization risk vanishes uniformly in expectation across all bounded parametric losses. It has been shown that uniform generalization is, in fact, equivalent to an information-theoretic stability constraint, and that it recovers classical results in learning theory. It is achievable under various settings, such as sample compression schemes, finite hypothesis spaces, finite domains, and differential privacy. However, the relationship between uniform generalization and concentration remained unknown. In this paper, we answer this question by proving that, while a generalization in expectation does not imply a generalization in probability, a uniform generalization in expectation does imply concentration. We establish a chain rule for the uniform generalization risk of the composition of hypotheses and use it to derive a large deviation bound. Finally, we prove that the bound is tight.
Efficient Active Learning of Halfspaces via Query Synthesis
Alabdulmohsin, Ibrahim (King Abdullah University of Science and Technology) | Gao, Xin (King Abdullah University of Science and Technology) | Zhang, Xiangliang (King Abdullah University of Science and Technology)
Active learning is a subfield of machine learning that has been successfully used in many applications including text classification and bioinformatics. One of the fundamental branches of active learning is query synthesis, where the learning agent constructs artificial queries from scratch in order to reveal sensitive information about the true decision boundary. Nevertheless, the existing literature on membership query synthesis has focused on finite concept classes with a limited extension to real-world applications. In this paper, we present an efficient spectral algorithm for membership query synthesis for halfspaces, whose sample complexity is experimentally shown to be near-optimal. At each iteration, the algorithm consists of two steps. First, a convex optimization problem is solved that provides an approximate characterization of the version space. Second, a principal component is extracted, which yields a synthetic query that shrinks the version space exponentially fast. Unlike traditional methods in active learning, the proposed method can be readily extended into the batch setting by solving for the top k eigenvectors in the second step. Experimentally, it exhibits a significant improvement over traditional approaches such as uncertainty sampling and representative sampling. For example, to learn a halfspace in the Euclidean plane with 25 dimensions and an estimation error of 1E-4, the proposed algorithm uses less than 3% of the number of queries required by uncertainty sampling.