AITopics | Webb, Russ

Collaborating Authors

Webb, Russ

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Distillation Scaling Laws

Busbridge, Dan, Shidani, Amitis, Weers, Floris, Ramapuram, Jason, Littwin, Etai, Webb, Russ

arXiv.org Machine LearningFeb-12-2025

We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.

distillation, large language model, machine learning, (21 more...)

arXiv.org Machine Learning

2502.08606

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Maryland (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Education > Assessment & Standards > Student Performance (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

Add feedback

Poly-View Contrastive Learning

Shidani, Amitis, Hjelm, Devon, Ramapuram, Jason, Webb, Russ, Dhekane, Eeshan Gunesh, Busbridge, Dan

arXiv.org Machine LearningMar-8-2024

Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.

artificial intelligence, conference paper, machine learning, (16 more...)

arXiv.org Machine Learning

2403.0549

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)

Add feedback

Bootstrap Your Own Variance

Turishcheva, Polina, Ramapuram, Jason, Williamson, Sinead, Busbridge, Dan, Dhekane, Eeshan, Webb, Russ

arXiv.org Machine LearningDec-5-2023

Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gaussian distribution, providing preliminary evidence that the learned parameter posterior is useful for label free uncertainty estimation. BYOV improves upon the deterministic BYOL baseline (+2.83% test ECE, +1.03% test Brier) and presents better calibration and reliability when tested with various augmentations (eg: +2.4% test ECE, +1.2% test Brier for Salt & Pepper noise).

artificial intelligence, international conference, machine learning, (12 more...)

arXiv.org Machine Learning

2312.03213

Country: Europe > Germany > Lower Saxony > Gottingen (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

How to Scale Your EMA

Busbridge, Dan, Ramapuram, Jason, Ablin, Pierre, Likhomanenko, Tatiana, Dhekane, Eeshan Gunesh, Suau, Xavier, Webb, Russ

arXiv.org Machine LearningNov-7-2023

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

2307.13813

Country:

Europe (1.00)
North America > United States > California (0.45)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer

Ovsianas, Andrius, Ramapuram, Jason, Busbridge, Dan, Dhekane, Eeshan Gunesh, Webb, Russ

arXiv.org Artificial IntelligenceOct-28-2022

Self-supervised representation learning (SSL) methods provide an effective label-free initial condition for fine-tuning downstream tasks. However, in numerous realistic scenarios, the downstream task might be biased with respect to the target label distribution. This in turn moves the learned fine-tuned model posterior away from the initial (label) bias-free self-supervised model posterior. In this work, we re-interpret SSL fine-tuning under the lens of Bayesian continual learning and consider regularization through the Elastic Weight Consolidation (EWC) framework. We demonstrate that self-regularization against an initial SSL backbone improves worst sub-group performance in Waterbirds by 5% and Celeb-A by 2% when using the ViT-B/16 architecture. Furthermore, to help simplify the use of EWC with SSL, we pre-compute and publicly release the Fisher Information Matrix (FIM), evaluated with 10,000 ImageNet-1K variates evaluated on large modern SSL architectures including ViT-B/16 and ResNet50 trained with DINO.

artificial intelligence, machine learning, vit-b 16, (18 more...)

arXiv.org Artificial Intelligence

2210.16365

Country:

North America > United States (0.46)
North America > Canada (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)

Add feedback

Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?

Grigg, Tom George, Busbridge, Dan, Ramapuram, Jason, Webb, Russ

arXiv.org Machine LearningOct-1-2021

Despite the success of a number of recent techniques for visual self-supervised deep learning, there remains limited investigation into the representations that are ultimately learned. By using recent advances in comparing neural representations, we explore in this direction by comparing a constrastive self-supervised algorithm (SimCLR) to supervision for simple image data in a common architecture. We find that the methods learn similar intermediate representations through dissimilar means, and that the representations diverge rapidly in the final few layers. We investigate this divergence, finding that it is caused by these layers strongly fitting to the distinct learning objectives. We also find that SimCLR's objective implicitly fits the supervised objective in intermediate layers, but that the reverse is not true. Our work particularly highlights the importance of the learned intermediate representations, and raises important questions for auxiliary task design.

artificial intelligence, machine learning, neural network, (16 more...)

arXiv.org Machine Learning

2110.00528

Country:

North America > United States (1.00)
North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Differentiable Approximation Bridges For Training Networks Containing Non-Differentiable Functions

Ramapuram, Jason, Webb, Russ

arXiv.org Machine LearningMay-9-2019

Modern neural network training relies on piece-wise (sub-)differentiable functions in order to use backpropation for efficient calculation of gradients. In this work, we introduce a novel method to allow for non-differentiable functions at intermediary layers of deep neural networks. We do so through the introduction of a differentiable approximation bridge (DAB) neural network which provides smooth approximations to the gradient of the non-differentiable function. We present strong empirical results (performing over 600 experiments) in three different domains: unsupervised (image) representation learning, image classification, and sequence sorting to demonstrate that our proposed method improves state of the art performance. We demonstrate that utilizing non-differentiable functions in unsupervised (image) representation learning improves reconstruction quality and posterior linear separability by 10%. We also observe an accuracy improvement of 77% in neural sequence sorting and a 25% improvement against the straight-through estimator [3] in an image classification setting with the sort non-linearity. This work enables the usage of functions that were previously not usable in neural networks.

deep learning, neural network, non-differentiable function, (16 more...)

arXiv.org Machine Learning

1905.03658

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

Mirroring to Build Trust in Digital Assistants

Metcalf, Katherine, Theobald, Barry-John, Weinberg, Garrett, Lee, Robert, Jonsson, Ing-Marie, Webb, Russ, Apostoloff, Nicholas

arXiv.org Artificial IntelligenceApr-2-2019

We describe experiments towards building a conversational digital assistant that considers the preferred conversational style of the user. In particular, these experiments are designed to measure whether users prefer and trust an assistant whose conversational style matches their own. To this end we conducted a user study where subjects interacted with a digital assistant that responded in a way that either matched their conversational style, or did not. Using self-reported personality attributes and subjects' feedback on the interactions, we built models that can reliably predict a user's preferred conversational style.

artificial intelligence, interaction, participant, (15 more...)

arXiv.org Artificial Intelligence

1904.01664

Country: North America > United States (0.14)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.68)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)

Add feedback

Variational Saccading: Efficient Inference for Large Resolution Images

Ramapuram, Jason, Diephuis, Maurits, Webb, Russ, Kalousis, Alexandros

arXiv.org Machine LearningDec-8-2018

Image classification with deep neural networks is typically restricted to images of small dimensionality such as 224x244 in Resnet models. This limitation excludes the 4000x3000 dimensional images that are taken by modern smartphone cameras and smart devices. In this work, we aim to mitigate the prohibitive inferential and memory costs of operating in such large dimensional spaces. To sample from the high-resolution original input distribution, we propose using a smaller proxy distribution to learn the co-ordinates that correspond to regions of interest in the high-dimensional space. We introduce a new principled variational lower bound that captures the relationship of the proxy distribution's posterior and the original image's co-ordinate space in a way that maximizes the conditional classification likelihood. We empirically demonstrate on one synthetic benchmark and one real world large resolution DSLR camera image dataset that our method produces comparable results with 10x faster inference and lower memory consumption than a model that utilizes the entire original input distribution.

computer vision and pattern recognition, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

1812.0317

Country: North America > Canada (0.14)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

A New Benchmark and Progress Toward Improved Weakly Supervised Learning

Ramapuram, Jason, Webb, Russ

arXiv.org Machine LearningJun-30-2018

Knowledge Matters: Importance of Prior Information for Optimization [7], by Gulcehre et. al., sought to establish the limits of current black-box, deep learning techniques by posing problems which are difficult to learn without engineering knowledge into the model or training procedure. In our work, we completely solve the previous Knowledge Matters problem using a generic model, pose a more difficult and scalable problem, All-Pairs, and advance this new problem by introducing a new learned, spatially-varying histogram model called TypeNet which outperforms conventional models on the problem. We present results on All-Pairs where our model achieves 100% test accuracy while the best ResNet models achieve 79% accuracy. In addition, our model is more than an order of magnitude smaller than Resnet-34. The challenge of solving larger-scale All-Pairs problems with high accuracy is presented to the community for investigation.

accuracy, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

1807.00126

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback