AITopics

2502.09445

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.44)

arXiv.org Artificial IntelligenceFeb-10-2025

Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

Donhauser, Konstantin, Arnal, Charles, Pezeshki, Mohammad, Cabannes, Vivien, Lopez-Paz, David, Ahuja, Kartik

The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.

adaptive long-context head identification, artificial intelligence, natural language, (2 more...)

2502.09647

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.87)

arXiv.org Machine LearningDec-10-2024

The Pitfalls of Memorization: When Memorization Hurts Generalization

Bayat, Reza, Pezeshki, Mohammad, Dohmatob, Elvis, Lopez-Paz, David, Vincent, Pascal

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

artificial intelligence, machine learning, memorization, (13 more...)

2412.07684

Country:

North America > Canada (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

arXiv.org Artificial IntelligenceDec-9-2024

Flow Matching Guide and Code

Lipman, Yaron, Havasi, Marton, Holderrieth, Peter, Shaul, Neta, Le, Matt, Karrer, Brian, Chen, Ricky T. Q., Lopez-Paz, David, Ben-Hamu, Heli, Gat, Itai

Flow Matching (FM) is a recent framework for generative modeling that has achieved state-of-the-art performance across various domains, including image, video, audio, speech, and biological structures. This guide offers a comprehensive and self-contained review of FM, covering its mathematical foundations, design choices, and extensions. By also providing a PyTorch package featuring relevant examples (e.g., image and text generation), this work aims to serve as a resource for both novice and experienced researchers interested in understanding, applying and further developing FM.

artificial intelligence, deep learning, machine learning, (16 more...)

2412.06264

Genre: Research Report (0.81)

Industry: Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Artificial IntelligenceApr-30-2024

Better & Faster Large Language Models via Multi-token Prediction

Gloeckle, Fabian, Idrissi, Badr Youbi, Rozière, Baptiste, Lopez-Paz, David, Synnaeve, Gabriel

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

large language model, machine learning, natural language, (18 more...)

2404.19737

Country:

North America > United States (0.14)
Europe > Spain (0.14)
Asia (0.14)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningOct-2-2023

Unified Uncertainty Calibration

Chaudhuri, Kamalika, Lopez-Paz, David

To build robust, fair, and safe AI systems, we would like our classifiers to say ``I don't know'' when facing test examples that are difficult or fall outside of the training classes.The ubiquitous strategy to predict under uncertainty is the simplistic \emph{reject-or-classify} rule: abstain from prediction if epistemic uncertainty is high, classify otherwise.Unfortunately, this recipe does not allow different sources of uncertainty to communicate with each other, produces miscalibrated predictions, and it does not allow to correct for misspecifications in our uncertainty estimates. To address these three issues, we introduce \emph{unified uncertainty calibration (U2C)}, a holistic framework to combine aleatoric and epistemic uncertainties. U2C enables a clean learning-theoretical analysis of uncertainty estimation, and outperforms reject-or-classify across a variety of ImageNet benchmarks.

large language model, machine learning, natural language, (19 more...)

2310.01202

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)

arXiv.org Machine LearningSep-28-2023

Discovering environments with XRM

Pezeshki, Mohammad, Bouchacourt, Diane, Ibrahim, Mark, Ballas, Nicolas, Vincent, Pascal, Lopez-Paz, David

Successful out-of-distribution generalization requires environment annotations. Unfortunately, these are resource-intensive to obtain, and their relevance to model performance is limited by the expectations and perceptual biases of human annotators. Therefore, to enable robust AI systems across applications, we must develop algorithms to automatically discover environments inducing broad generalization. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods add hyper-parameters and early-stopping criteria that are impossible to tune without a validation set with human-annotated environments, the very information subject to discovery. XRM trains two twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Domain generalization algorithms built on top of XRM environments achieve oracle worst-group-accuracy, solving a long-standing problem in out-of-distribution generalization. AI systems pervade our lives, spanning applications such as finance (Hand and Henley, 1997), healthcare (Jiang et al., 2017), self-driving vehicles (Bojarski et al., 2016), and justice (Angwin et al., 2016). While machines appear to outperform humans on such tasks, these systems fall apart when deployed in testing conditions different to their experienced training environments (Geirhos et al., 2020).

annotation, artificial intelligence, machine learning, (16 more...)

2309.16748

Country: North America (0.28)

Genre: Research Report (0.43)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.54)

arXiv.org Machine LearningSep-20-2023

Context is Environment

Gupta, Sharut, Jegelka, Stefanie, Lopez-Paz, David, Ahuja, Kartik

One key problem in AI research is to build systems that generalize across a wide range of test environments. In principle, these algorithms should discard spurious correlations present only in certain training environments, and capture invariant patterns appearing across conditions. For example, we would like to build self-driving systems that, while trained on data from environments with varying weather conditions, traffic conditions, and driving rules, can perform satisfactorily in completely new environments. Unfortunately, this has so far been a far cry: models trained catastrophically fail to generalize to unseen weather conditions [Lechner et al., 2022]. Despite its importance, how to perform well beyond the distribution of the training data remains a burning question. In fact, entire research groups are devoted to study generalization, major international conferences offer well-attended workshops dedicated to the issue [Wald et al., 2023], and news articles remind us of the profound societal impact from failures of ML systems [Angwin et al., 2016]. Research efforts have so far produced domain generalization algorithms that fall into one out of two broad categories. On the one hand, invariance proposals [Ganin et al., 2016, Peters et al., 2016, Arjovsky et al., 2019], illustrated in Figure 1a, discard all environment-specific information, thus removing excessive signal about the problem. On the other hand, marginal transfer proposals [Blanchard et al., 2011, Li et al., 2016, Zhang et al., 2020, Bao and Karaletsos, 2023], also illustrated in Figure 1b, summarize observed inputs in each environment as a coarse embedding, diluting important signal at the example level.

artificial intelligence, machine learning, natural language, (19 more...)

2309.09888

Country: Europe (1.00)

Genre: Research Report (1.00)

Industry: Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

arXiv.org Artificial IntelligenceAug-9-2023

Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization

Ramé, Alexandre, Ahuja, Kartik, Zhang, Jianyu, Cord, Matthieu, Bottou, Léon, Lopez-Paz, David

Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune the weights on the target task of interest. So, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks: these individual fine-tunings exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain rich and diverse features. In this paper, we thus propose model ratatouille, a new strategy to recycle the multiple fine-tunings of the same foundation model on diverse auxiliary tasks. Specifically, we repurpose these auxiliary weights as initializations for multiple parallel fine-tunings on the target task; then, we average all fine-tuned weights to obtain the final model. This recycling strategy aims at maximizing the diversity in weights by leveraging the diversity in auxiliary tasks. Empirically, it improves the state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, this work contributes to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to reliably update machine learning models. Our code is released: https://github.com/facebookresearch/ModelRatatouille.

artificial intelligence, machine learning, ood test acc, (15 more...)

2212.10445

Country:

North America > United States (0.67)
Europe (0.46)

Genre: Research Report (0.81)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Information Technology > Security & Privacy (0.46)
Health & Medicine > Diagnostic Medicine > Imaging (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-26-2023

A Closer Look at In-Context Learning under Distribution Shifts

Ahuja, Kartik, Lopez-Paz, David

In-context learning, a capability that enables a model to learn from input examples on the fly without necessitating weight updates, is a defining characteristic of large language models. In this work, we follow the setting proposed in (Garg et al., 2022) to better understand the generality and limitations of in-context learning from the lens of the simple yet fundamental task of linear regression. The key question we aim to address is: Are transformers more adept than some natural and simpler architectures at performing in-context learning under varying distribution shifts? To compare transformers, we propose to use a simple architecture based on set-based Multi-Layer Perceptrons (MLPs). We find that both transformers and set-based MLPs exhibit in-context learning under in-distribution evaluations, but transformers more closely emulate the performance of ordinary least squares (OLS). Transformers also display better resilience to mild distribution shifts, where set-based MLPs falter. However, under severe distribution shifts, both models' in-context learning abilities diminish.

distribution shift, machine learning, natural language, (14 more...)

2305.16704

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)