AITopics | Pezeshki, Mohammad

Collaborating Authors

Pezeshki, Mohammad

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Steering Large Language Model Activations in Sparse Spaces

Bayat, Reza, Rahimi-Kalahroudi, Ali, Pezeshki, Mohammad, Chandar, Sarath, Vincent, Pascal

arXiv.org Artificial IntelligenceFeb-28-2025

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.00177

Country: North America > Canada > Quebec (0.14)

Genre: Research Report > New Finding (0.92)

Industry:

Media (0.68)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Askari-Hemmat, Reyhane, Pezeshki, Mohammad, Dohmatob, Elvis, Bordes, Florian, Astolfi, Pietro, Hall, Melissa, Verbeek, Jakob, Drozdzal, Michal, Romero-Soriano, Adriana

arXiv.org Artificial IntelligenceFeb-21-2025

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

artificial intelligence, iteration, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.15588

Country:

North America > United States (0.28)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

Donhauser, Konstantin, Arnal, Charles, Pezeshki, Mohammad, Cabannes, Vivien, Lopez-Paz, David, Ahuja, Kartik

arXiv.org Artificial IntelligenceFeb-10-2025

The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.

adaptive long-context head identification, artificial intelligence, natural language, (2 more...)

arXiv.org Artificial Intelligence

2502.09647

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.87)

Add feedback

The Pitfalls of Memorization: When Memorization Hurts Generalization

Bayat, Reza, Pezeshki, Mohammad, Dohmatob, Elvis, Lopez-Paz, David, Vincent, Pascal

arXiv.org Machine LearningDec-10-2024

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

artificial intelligence, machine learning, memorization, (13 more...)

arXiv.org Machine Learning

2412.07684

Country:

North America > Canada (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)

Add feedback

Compositional Risk Minimization

Mahajan, Divyat, Pezeshki, Mohammad, Mitliagkas, Ioannis, Ahuja, Kartik, Vincent, Pascal

arXiv.org Artificial IntelligenceOct-8-2024

In this work, we tackle a challenging and extreme form of subpopulation shift, which is termed compositional shift. Under compositional shifts, some combinations of attributes are totally absent from the training distribution but present in the test distribution. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

affine hull, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.06303

Country: North America > Canada (0.14)

Genre:

Research Report (1.00)
Overview (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Feedback-guided Data Synthesis for Imbalanced Classification

Hemmat, Reyhane Askari, Pezeshki, Mohammad, Bordes, Florian, Drozdzal, Michal, Romero-Soriano, Adriana

arXiv.org Artificial IntelligenceSep-29-2023

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

artificial intelligence, feedback-guided data synthesis, machine learning, (1 more...)

arXiv.org Artificial Intelligence

2310.00158

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback

Discovering environments with XRM

Pezeshki, Mohammad, Bouchacourt, Diane, Ibrahim, Mark, Ballas, Nicolas, Vincent, Pascal, Lopez-Paz, David

arXiv.org Machine LearningSep-28-2023

Successful out-of-distribution generalization requires environment annotations. Unfortunately, these are resource-intensive to obtain, and their relevance to model performance is limited by the expectations and perceptual biases of human annotators. Therefore, to enable robust AI systems across applications, we must develop algorithms to automatically discover environments inducing broad generalization. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods add hyper-parameters and early-stopping criteria that are impossible to tune without a validation set with human-annotated environments, the very information subject to discovery. XRM trains two twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Domain generalization algorithms built on top of XRM environments achieve oracle worst-group-accuracy, solving a long-standing problem in out-of-distribution generalization. AI systems pervade our lives, spanning applications such as finance (Hand and Henley, 1997), healthcare (Jiang et al., 2017), self-driving vehicles (Bojarski et al., 2016), and justice (Angwin et al., 2016). While machines appear to outperform humans on such tasks, these systems fall apart when deployed in testing conditions different to their experienced training environments (Geirhos et al., 2020).

annotation, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2309.16748

Country: North America (0.28)

Genre: Research Report (0.43)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.54)

Add feedback

Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok

Notsawo, Pascal Jr. Tikeng, Zhou, Hattie, Pezeshki, Mohammad, Rish, Irina, Dumas, Guillaume

arXiv.org Artificial IntelligenceSep-28-2023

This paper focuses on predicting the occurrence of grokking in neural networks, a phenomenon in which perfect generalization emerges long after signs of overfitting or memorization are observed. It has been reported that grokking can only be observed with certain hyper-parameters. This makes it critical to identify the parameters that lead to grokking. However, since grokking occurs after a large number of epochs, searching for the hyper-parameters that lead to it is time-consuming. In this paper, we propose a low-cost method to predict grokking without training for a large number of epochs. In essence, by studying the learning curve of the first few epochs, we show that one can predict whether grokking will occur later on. Specifically, if certain oscillations occur in the early epochs, one can expect grokking to occur if the model is trained for a much longer period of time. We propose using the spectral signature of a learning curve derived by applying the Fourier transform to quantify the amplitude of low-frequency components to detect the presence of such oscillations. We also present additional experiments aimed at explaining the cause of these oscillations and characterizing the loss landscape.

artificial intelligence, loss landscape, machine learning, (2 more...)

arXiv.org Artificial Intelligence

2306.13253

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Multi-scale Feature Learning Dynamics: Insights for Double Descent

Pezeshki, Mohammad, Mitra, Amartya, Bengio, Yoshua, Lajoie, Guillaume

arXiv.org Artificial IntelligenceDec-6-2021

A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the high-dimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.

artificial intelligence, machine learning, null, (16 more...)

arXiv.org Artificial Intelligence

2112.03215

Country:

North America > Canada (0.28)
North America > United States > California (0.14)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Simple data balancing achieves competitive worst-group-accuracy

Idrissi, Badr Youbi, Arjovsky, Martin, Pezeshki, Mohammad, Lopez-Paz, David

arXiv.org Artificial IntelligenceOct-27-2021

We study the problem of learning classifiers that perform well across (known or unknown) groups of data. After observing that common worst-group-accuracy datasets suffer from substantial imbalances, we set out to compare state-of-the-art methods to simple balancing of classes and groups by either subsampling or reweighting data. Our results show that these data balancing baselines achieve state-of-the-art-accuracy, while being faster to train and requiring no additional hyper-parameters. In addition, we highlight that access to group information is most critical for model selection purposes, and not so much during training. All in all, our findings beg closer examination of benchmarks and methods for research in worst-group-accuracy optimization.

artificial intelligence, machine learning, test worst-group-accuracy, (15 more...)

arXiv.org Artificial Intelligence

2110.14503

Country:

Europe > France (0.15)
North America > Canada (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback