AITopics

2212.08013

Country:

North America > United States (0.28)
North America > Canada (0.28)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceFeb-10-2023

Scaling Vision Transformers to 22 Billion Parameters

Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, Jenatton, Rodolphe, Beyer, Lucas, Tschannen, Michael, Arnab, Anurag, Wang, Xiao, Riquelme, Carlos, Minderer, Matthias, Puigcerver, Joan, Evci, Utku, Kumar, Manoj, van Steenkiste, Sjoerd, Elsayed, Gamaleldin F., Mahendran, Aravindh, Yu, Fisher, Oliver, Avital, Huot, Fantine, Bastings, Jasmijn, Collier, Mark Patrick, Gritsenko, Alexey, Birodkar, Vighnesh, Vasconcelos, Cristina, Tay, Yi, Mensink, Thomas, Kolesnikov, Alexander, Pavetić, Filip, Tran, Dustin, Kipf, Thomas, Lučić, Mario, Zhai, Xiaohua, Keysers, Daniel, Harmsen, Jeremiah, Houlsby, Neil

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

artificial intelligence, machine learning, natural language, (19 more...)

2302.05442

Country: North America (0.46)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Transportation > Ground > Road (0.67)
Automobiles & Trucks (0.67)
Transportation > Passenger (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-21-2022

Adapting to Latent Subgroup Shifts via Concepts and Proxies

Alabdulmohsin, Ibrahim, Chiou, Nicole, D'Amour, Alexander, Gretton, Arthur, Koyejo, Sanmi, Kusner, Matt J., Pfohl, Stephen R., Salaudeen, Olawale, Schrouff, Jessica, Tsai, Katherine

We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.

adaptation, artificial intelligence, machine learning, (15 more...)

2212.11254

Country: North America (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Epidemiology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.66)

arXiv.org Artificial IntelligenceNov-18-2022

Layer-Stack Temperature Scaling

Khalifa, Amr, Mozer, Michael C., Sedghi, Hanie, Neyshabur, Behnam, Alabdulmohsin, Ibrahim

Recent works demonstrate that early layers in a neural network contain useful information for prediction. Inspired by this, we show that extending temperature scaling across all layers improves both calibration and accuracy. We call this procedure "layer-stack temperature scaling" (LATES). Informally, LATES grants each layer a weighted vote during inference. We evaluate it on five popular convolutional neural network architectures both in- and out-of-distribution and observe a consistent improvement over temperature scaling in terms of accuracy, calibration, and AUC. All conclusions are supported by comprehensive statistical analyses. Since LATES neither retrains the architecture nor introduces many more parameters, its advantages can be reaped without requiring additional data beyond what is used in temperature scaling. Finally, we show that combining LATES with Monte Carlo Dropout matches state-of-the-art results on CIFAR10/100.

artificial intelligence, machine learning, neural network, (17 more...)

2211.10193

Genre: Research Report > Experimental Study (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Machine LearningFeb-2-2022

Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?

Schrouff, Jessica, Harris, Natalie, Koyejo, Oluwasanmi, Alabdulmohsin, Ibrahim, Schnider, Eva, Opsahl-Ong, Krista, Brown, Alex, Roy, Subhrajit, Mincu, Diana, Chen, Christina, Dieng, Awa, Liu, Yuan, Natarajan, Vivek, Karthikesalingam, Alan, Heller, Katherine, Chiappa, Silvia, D'Amour, Alexander

Fairness and robustness are often considered as orthogonal dimensions when evaluating machine learning models. However, recent work has revealed interactions between fairness and robustness, showing that fairness properties are not necessarily maintained under distribution shift. In healthcare settings, this can result in e.g. a model that performs fairly according to a selected metric in "hospital A" showing unfairness when deployed in "hospital B". While a nascent field has emerged to develop provable fair and robust models, it typically relies on strong assumptions about the shift, limiting its impact for real-world applications. In this work, we explore the settings in which recently proposed mitigation strategies are applicable by referring to a causal framing. Using examples of predictive models in dermatology and electronic health records, we show that real-world applications are complex and often invalidate the assumptions of such methods. Our work hence highlights technical, practical, and engineering gaps that prevent the development of robustly fair machine learning models for real-world applications. Finally, we discuss potential remedies at each step of the machine learning pipeline.

artificial intelligence, distribution shift, machine learning, (22 more...)

2202.01034

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England (0.14)
North America > United States > Michigan (0.14)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area > Dermatology (1.00)
Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Diagnostic Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningJun-6-2021

A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models

Alabdulmohsin, Ibrahim, Lucic, Mario

We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.

artificial intelligence, machine learning model, neural network, (4 more...)

2106.12887

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

arXiv.org Machine LearningNov-11-2020

What Do Neural Networks Learn When Trained With Random Labels?

Maennel, Hartmut, Alabdulmohsin, Ibrahim, Tolstikhin, Ilya, Baldock, Robert J. N., Bousquet, Olivier, Gelly, Sylvain, Keysers, Daniel

We study deep neural networks (DNNs) trained on natural image data with entirely random labels. Despite its popularity in the literature, where it is often used to study memorization, generalization, and other phenomena, little is known about what DNNs learn in this setting. In this paper, we show analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels. We study this alignment effect by investigating neural networks pre-trained on randomly labelled image data and subsequently fine-tuned on disjoint datasets with random or real labels. We show how this alignment produces a positive transfer: networks pre-trained with random labels train faster downstream compared to training from scratch even after accounting for simple effects, such as weight scaling. We analyze how competing effects, such as specialization at later layers, may hide the positive transfer. These effects are studied in several network architectures, including VGG16 and ResNet18, on CIFAR10 and ImageNet.

deep learning, neural network, random label, (20 more...)

2006.10455

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

arXiv.org Artificial IntelligenceMay-21-2020

Fair Classification via Unconstrained Optimization

Alabdulmohsin, Ibrahim

Achieving the Bayes optimal binary classification rule subject to group fairness constraints is known to be reducible, in some cases, to learning a group-wise thresholding rule over the Bayes regressor. In this paper, we extend this result by proving that, in a broader setting, the Bayes optimal fair learning rule remains a group-wise thresholding rule over the Bayes regressor but with a (possible) randomization at the thresholds. This provides a stronger justification to the post-processing approach in fair classification, in which (1) a predictor is learned first, after which (2) its output is adjusted to remove bias. We show how the post-processing rule in this two-stage approach can be learned quite efficiently by solving an unconstrained optimization problem. The proposed algorithm can be applied to any black-box machine learning model, such as deep neural networks, random forests and support vector machines. In addition, it can accommodate many fairness criteria that have been previously proposed in the literature, such as equalized odds and statistical parity. We prove that the algorithm is Bayes consistent and motivate it, furthermore, via an impossibility result that quantifies the tradeoff between accuracy and fairness across multiple demographic groups. Finally, we conclude by validating the algorithm on the Adult benchmark dataset.

algorithm, deep learning, neural network, (19 more...)

2005.14621

Country:

North America > United States (1.00)
Europe (0.93)

Genre: Research Report (0.64)

Industry:

Law (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Machine LearningOct-3-2016

Uniform Generalization, Concentration, and Adaptive Learning

Alabdulmohsin, Ibrahim

One fundamental goal in any learning algorithm is to mitigate its risk for overfitting. Mathematically, this requires that the learning algorithm enjoys a small generalization risk, which is defined either in expectation or in probability. Both types of generalization are commonly used in the literature. For instance, generalization in expectation has been used to analyze algorithms, such as ridge regression and SGD, whereas generalization in probability is used in the VC theory, among others. Recently, a third notion of generalization has been studied, called uniform generalization, which requires that the generalization risk vanishes uniformly in expectation across all bounded parametric losses. It has been shown that uniform generalization is, in fact, equivalent to an information-theoretic stability constraint, and that it recovers classical results in learning theory. It is achievable under various settings, such as sample compression schemes, finite hypothesis spaces, finite domains, and differential privacy. However, the relationship between uniform generalization and concentration remained unknown. In this paper, we answer this question by proving that, while a generalization in expectation does not imply a generalization in probability, a uniform generalization in expectation does imply concentration. We establish a chain rule for the uniform generalization risk of the composition of hypotheses and use it to derive a large deviation bound. Finally, we prove that the bound is tight.

computer based training, educational technology, generalization, (19 more...)

1608.06072

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

AAAI ConferencesMar-6-2015

Efficient Active Learning of Halfspaces via Query Synthesis

Alabdulmohsin, Ibrahim (King Abdullah University of Science and Technology) | Gao, Xin (King Abdullah University of Science and Technology) | Zhang, Xiangliang (King Abdullah University of Science and Technology)

Active learning is a subfield of machine learning that has been successfully used in many applications including text classification and bioinformatics. One of the fundamental branches of active learning is query synthesis, where the learning agent constructs artificial queries from scratch in order to reveal sensitive information about the true decision boundary. Nevertheless, the existing literature on membership query synthesis has focused on finite concept classes with a limited extension to real-world applications. In this paper, we present an efficient spectral algorithm for membership query synthesis for halfspaces, whose sample complexity is experimentally shown to be near-optimal. At each iteration, the algorithm consists of two steps. First, a convex optimization problem is solved that provides an approximate characterization of the version space. Second, a principal component is extracted, which yields a synthetic query that shrinks the version space exponentially fast. Unlike traditional methods in active learning, the proposed method can be readily extended into the batch setting by solving for the top k eigenvectors in the second step. Experimentally, it exhibits a significant improvement over traditional approaches such as uncertainty sampling and representative sampling. For example, to learn a halfspace in the Euclidean plane with 25 dimensions and an estimation error of 1E-4, the proposed algorithm uses less than 3% of the number of queries required by uncertainty sampling.

artificial intelligence, machine learning, query, (19 more...)

AAAI Conferences

Twenty-Ninth AAAI Conference on Artificial Intelligence

Country:

North America > United States > Wisconsin (0.14)
Asia > Middle East > Saudi Arabia (0.14)

Industry: Information Technology (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)