Goto

Collaborating Authors

 Farajtabar, Mehrdad


TiC-CLIP: Continual Training of CLIP Models

arXiv.org Artificial Intelligence

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch.


CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

arXiv.org Artificial Intelligence

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification. Foundation Models (FMs) are revolutionizing different domains of artificial intelligence and machine learning, including computer vision (Radford et al., 2021; He et al., 2022; Kirillov et al., 2023b) and natural language processing (Devlin et al., 2018; Brown et al., 2020; Touvron et al., 2023). FMs can be trained on web crawled data without relying on crowd or expert annotations, and yet they demonstrate strong generalization capabilities (Jia et al., 2021; Schuhmann et al., 2022).


ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in ReLU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights, we propose practical strategies to substantially reduce LLM inference computation up to three times, using ReLU activations with minimal performance trade-offs.


Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement

arXiv.org Artificial Intelligence

We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-based models and performing large-scale study of distillation with state-of-the-art models with various data augmentations. We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks (e.g., segmentation and detection). As an example, the accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the ImageNet validation set is also reduced by 9.9%. Using this backbone with Mask-RCNN for object detection on MS-COCO, the mean average precision improves by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers. For MobileNetV3 and Swin-Tiny, we observe significant improvements on ImageNet-R/A/C of up to 20% improved robustness. Models pretrained on ImageNet+ and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4% improved accuracy. The code, datasets, and pretrained models are available at https://github.com/apple/ml-dr.


Continual Learning Beyond a Single Model

arXiv.org Artificial Intelligence

A growing body of research in continual learning focuses on the catastrophic forgetting problem. While many attempts have been made to alleviate this problem, the majority of the methods assume a single model in the continual learning setup. In this work, we question this assumption and show that employing ensemble models can be a simple yet effective method to improve continual performance. However, ensembles' training and inference costs can increase significantly as the number of models grows. Motivated by this limitation, we study different ensemble models to understand their benefits and drawbacks in continual learning scenarios. Finally, to overcome the high compute cost of ensembles, we leverage recent advances in neural network subspace to propose a computationally cheap algorithm with similar runtime to a single model yet enjoying the performance benefits of ensembles. Continual learning (CL) and Lifelong learning (Thrun, 1994) have recently gained popularity since many real-world applications fall into that setting. It describes the scenario where not only a stream of data arrives sequentially, but their distribution also changes over time. This setup induces Catastrophic Forgetting (CF) (McCloskey & Cohen, 1989) which is a degradation of performances on previous data due to distribution shift between tasks (Doan et al., 2021). One fundamental goal in continual learning is to learn from the new incoming tasks while retaining knowledge from the past and avoiding interference that can lead to poor performance (Lesort et al., 2021). This becomes particularly challenging when the stream of data increases because all the burden is left to a single model. A simple yet effective solution is to rely on an ensemble method that improves performance over a single model. Inspired by bootstrapping (Breiman, 1996), deep ensembles initialize and train multiple neural networks independently (Lakshminarayanan et al., 2017; Fort et al., 2019).


Architecture Matters in Continual Learning

arXiv.org Artificial Intelligence

Continual learning (CL) (Ring, 1995; Thrun, 1995) is a branch of machine learning where the model is exposed to a sequence of tasks with the hope of exploiting existing knowledge to adapt quickly to new tasks. The research in continual learning has seen a surge in the past few years with the explicit focus of developing algorithms that can alleviate catastrophic forgetting (McCloskey and Cohen, 1989)--whereby the model abruptly forgets the information of the past when trained on new tasks. While most of the research in continual learning is focused on developing learning algorithms, that can perform better than naive fine-tuning on a stream of data, the role of model architecture, to the best of our knowledge, is not explicitly studied in any of the existing works. Even the class of parameter isolation or expansion-based methods, for example (Rusu et al., 2016; Yoon et al., 2018), have a cursory focus on the model architecture insofar that they assume a specific architecture and try to find an algorithm operating on the architecture. Orthogonal to this direction for designing algorithms, our motivation is that the inductive biases induced by different architectural components are important for continual learning. We seek to characterize the implication of different architectural choices. To motivate, consider a ResNet-18 model (He et al., 2016) on Split CIFAR-100, where CIFAR-100 dataset (Krizhevsky et al., 2009) is split into 20 disjoint sets--a prevalent architecture and benchmark in the existing continual learning works. Figure 1a shows that explicitly designed CL algorithms, EWC (Kirkpatrick et al., 2017) (a parameter regularization-based method) and experience replay (Riemer et al., Work done during an internship at DeepMind.


Wide Neural Networks Forget Less Catastrophically

arXiv.org Artificial Intelligence

Machine learning is relying more and more on training large models on large static datasets to reach impressive results (Kaplan et al., 2020; Lazaridou et al., 2021; Hombaiah et al., 2021). However, the real world is changing over time and new information is becoming available at an unprecedented rate (Lazaridou et al., 2021; Hombaiah et al., 2021). In such real world problems, the learning agent is exposed to a continuous stream of data, with potentially changing data distribution, and it has to absorb new information efficiently while not being able to iterate on previous data as freely as wanted due to time, sample, compute, privacy, or environmental complexity issues (Parisi et al., 2018). To overcome these inefficiencies, fields, such as Continual learning (CL) (Ring et al., 1994) or lifelong learning (Thrun, 1995) are gaining a lot of attention recently. One of the key challenges in continual learning models is the abrupt erasure of previous knowledge, referred to as Catastrophic Forgetting (CF) (McCloskey and Cohen, 1989). Alleviating catastrophic forgetting has attracted a lot of attention lately, and many interesting solutions are proposed to partly overcome the issue (e.g., Toneva et al., 2018; Nguyen et al., 2019; Hsu et al., 2018; Li et al., 2019; Wallingford et al., 2020). These solutions vary in degree of complexity from simple replay-based methods to complicated regularization or network expansion-based methods. Unfortunately, however, there is not much fundamental understanding of the intrinsic properties of neural networks that affects continual learning performance through catastrophic forgetting or forward/backward transfer (Mirzadeh et al., 2020). Work done during an internship at DeepMind.


Task-agnostic Continual Learning with Hybrid Probabilistic Models

arXiv.org Machine Learning

Learning new tasks continuously without forgetting on a constantly changing data distribution is essential for real-world problems but extremely challenging for modern deep learning. In this work we propose HCL, a Hybrid generative-discriminative approach to Continual Learning for classification. We model the distribution of each task and each class with a normalizing flow. The flow is used to learn the data distribution, perform classification, identify task changes, and avoid forgetting, all leveraging the invertibility and exact likelihood which are uniquely enabled by the normalizing flow model. We use the generative capabilities of the flow to avoid catastrophic forgetting through generative replay and a novel functional regularization technique. For task identification, we use state-of-the-art anomaly detection techniques based on measuring the typicality of the model's statistics. We demonstrate the strong performance of HCL on a range of continual learning benchmarks such as split-MNIST, split-CIFAR, and SVHN-MNIST.


Balance Regularized Neural Network Models for Causal Effect Estimation

arXiv.org Machine Learning

Estimating individual and average treatment effects from observational data is an important problem in many domains such as healthcare and e-commerce. In this paper, we advocate balance regularization of multi-head neural network architectures. Our work is motivated by representation learning techniques to reduce differences between treated and untreated distributions that potentially arise due to confounding factors. We further regularize the model by encouraging it to predict control outcomes for individuals in the treatment group that are similar to control outcomes in the control group. We empirically study the bias-variance trade-off between different weightings of the regularizers, as well as between inductive and transductive inference.


Optimization and Generalization of Regularization-Based Continual Learning: a Loss Approximation Viewpoint

arXiv.org Machine Learning

Neural networks have achieved remarkable success in many cognitive tasks. However, when they are trained sequentially on multiple tasks without access to old data, their performance on early tasks tend to drop significantly. This problem is often referred to as catastrophic forgetting, a key challenge in continual learning of neural networks. The regularization-based approach is one of the primary classes of methods to alleviate catastrophic forgetting. In this paper, we provide a novel viewpoint of regularization-based continual learning by formulating it as a second-order Taylor approximation of the loss function of each task. This viewpoint leads to a unified framework that can be instantiated to derive many existing algorithms such as Elastic Weight Consolidation and Kronecker factored Laplace approximation. Based on this viewpoint, we study the optimization aspects (i.e., convergence) as well as generalization properties (i.e., finite-sample guarantees) of regularization-based continual learning. Our theoretical results indicate the importance of accurate approximation of the Hessian matrix. The experimental results on several benchmarks provide empirical validation of our theoretical findings.