Goto

Collaborating Authors

 Vani, Ankit


Forget Sharpness: Perturbed Forgetting of Model Biases Within SAM Dynamics

arXiv.org Artificial Intelligence

Despite attaining high empirical generalization, the sharpness of models trained with sharpness-aware minimization (SAM) do not always correlate with generalization error. Instead of viewing SAM as minimizing sharpness to improve generalization, our paper considers a new perspective based on SAM's training dynamics. We propose that perturbations in SAM perform perturbed forgetting, where they discard undesirable model biases to exhibit learning signals that generalize better. We relate our notion of forgetting to the information bottleneck principle, use it to explain observations like the better generalization of smaller perturbation batches, and show that perturbed forgetting can exhibit a stronger correlation with generalization than flatness. While standard SAM targets model biases exposed by the steepest ascent directions, we propose a new perturbation that targets biases exposed through the model's outputs. Our output bias forgetting perturbations outperform standard SAM, GSAM, and ASAM on ImageNet, robustness benchmarks, and transfer to CIFAR-{10,100}, while sometimes converging to sharper regions. Our results suggest that the benefits of SAM can be explained by alternative mechanistic principles that do not require flatness of the loss surface.


SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

arXiv.org Artificial Intelligence

Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose Sparo, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using Sparo with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using Sparo, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual Sparo concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of Sparo's representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.


On the Compositional Generalization Gap of In-Context Learning

arXiv.org Artificial Intelligence

Pretrained large generative language models have shown great performance on many tasks, but exhibit low compositional generalization abilities. Scaling such models has been shown to improve their performance on various NLP tasks even just by conditioning them on a few examples to solve the task without any fine-tuning (also known as in-context learning). In this work, we look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. In the ID settings, the demonstrations are from the same split (test or train) that the model is being evaluated on, and in the OOD settings, they are from the other split. We look at how the relative generalization gap of in-context learning evolves as models are scaled up. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets, CFQ, SCAN and GeoQuery with different number of exemplars, and observe a trend of decreasing relative generalization gap as models are scaled up.


Fortuitous Forgetting in Connectionist Networks

arXiv.org Artificial Intelligence

Forgetting is often seen as an unwanted characteristic in both human and machine learning. However, we propose that forgetting can in fact be favorable to learning. We introduce forget-and-relearn as a powerful paradigm for shaping the learning trajectories of artificial neural networks. In this process, the forgetting step selectively removes undesirable information from the model, and the relearning step reinforces features that are consistently useful under different conditions. The forget-and-relearn framework unifies many existing iterative training algorithms in the image classification and language emergence literature, and allows us to understand the success of these algorithms in terms of the disproportionate forgetting of undesirable information. We leverage this understanding to improve upon existing algorithms by designing more targeted forgetting operations. Insights from our analysis provide a coherent view on the dynamics of iterative training in neural networks and offer a clear path towards performance improvements. Forgetting is an inescapable component of human memory. It occurs naturally as neural synapses get removed or altered over time (Wang et al., 2020), and is often thought to be an undesirable characteristic of the human mind. A well-known example is the "spacing effect", which refers to the observation that long-term recall is enhanced by spacing, rather than massing, repeated study sessions. Bjork & Allen (1970) demonstrated that the key to the spacing effect is the decreased accessibility of information in-between sessions. In this work, we study a general learning paradigm that we refer to as forget-and-relearn, and show that forgetting can also benefit learning in artificial neural networks. To generalize to unseen data, we want our models to capture generalizable concepts rather than purely statistical regularities, but these desirable solutions are a small subset of the solution space and often more difficult to learn naturally (Geirhos et al., 2020). Recently, a number of training algorithms have been proposed to improve generalization by iteratively refining the learned solution. Knowledge evolution (Taha et al., 2021) improves generalization by iteratively reinitializing one part of the network while continuously training the other. Iterative magnitude pruning (Frankle & Carbin, 2019; Frankle et al., 2019) removes weights through an iterative pruning-retraining process, and outperforms unpruned models in certain settings. Hoang et al. (2018) iteratively utilize synthetic machine translation corpus through back-translations of monolingual data.


GEAR: Geometry-Aware R\'enyi Information

arXiv.org Machine Learning

Shannon's seminal theory of information has been of paramount importance in the development of modern machine learning techniques. However, standard information measures deal with probability distributions over an alphabet considered as a mere set of symbols and disregard further geometric structure, which might be available in the form of a metric or similarity function. We advocate the use of a notion of entropy that reflects not only the relative abundances of symbols but also the similarities between them, which was originally introduced in theoretical ecology to study the diversity of biological communities. Echoing this idea, we propose a criterion for comparing two probability distributions (possibly degenerate and with non-overlapping supports) that takes into account the geometry of the space in which the distributions are defined. Our proposal exhibits performance on par with state-of-the-art methods based on entropy-regularized optimal transport, but enjoys a closed-form expression and thus a lower computational cost. We demonstrate the versatility of our proposal via experiments on a broad range of domains: computing image barycenters, approximating densities with a collection of (super-) samples; summarizing texts; assessing mode coverage; as well as training generative models.


Grounded Recurrent Neural Networks

arXiv.org Machine Learning

In this work, we present the Grounded Recurrent Neural Network (GRNN), a recurrent neural network architecture for multi-label prediction which explicitly ties labels to specific dimensions of the recurrent hidden state (we call this process "grounding"). The approach is particularly well-suited for extracting large numbers of concepts from text. We apply the new model to address an important problem in healthcare of understanding what medical concepts are discussed in clinical text. Using a publicly available dataset derived from Intensive Care Units, we learn to label a patient's diagnoses and procedures from their discharge summary. Our evaluation shows a clear advantage to using our proposed architecture over a variety of strong baselines.