AITopics | Jaggi, Martin

Collaborating Authors

Jaggi, Martin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Hägele, Alexander, Bakouch, Elie, Kosson, Atli, Allal, Loubna Ben, Von Werra, Leandro, Jaggi, Martin

arXiv.org Artificial IntelligenceMay-29-2024

Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative -- constant learning rate and cooldowns -- and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs.

cooldown, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2405.18392

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

The Privacy Power of Correlated Noise in Decentralized Learning

Allouah, Youssef, Koloskova, Anastasia, Firdoussi, Aymane El, Jaggi, Martin, Guerraoui, Rachid

arXiv.org Machine LearningMay-3-2024

Decentralized learning is appealing as it enables the scalable usage of large amounts of distributed data and resources (without resorting to any central entity), while promoting privacy since every user minimizes the direct exposure of their data. Yet, without additional precautions, curious users can still leverage models obtained from their peers to violate privacy. In this paper, we propose Decor, a variant of decentralized SGD with differential privacy (DP) guarantees. Essentially, in Decor, users securely exchange randomness seeds in one communication round to generate pairwise-canceling correlated Gaussian noises, which are injected to protect local models at every communication round. We theoretically and empirically show that, for arbitrary connected graphs, Decor matches the central DP optimal privacy-utility trade-off. We do so under SecLDP, our new relaxation of local DP, which protects all user communications against an external eavesdropper and curious users, assuming that every pair of connected users shares a secret, i.e., an information hidden to all others. The main theoretical challenge is to control the accumulation of non-canceling correlated noise due to network sparsity. We also propose a companion SecLDP privacy accountant for public use.

artificial intelligence, correlated noise, machine learning, (12 more...)

arXiv.org Machine Learning

2405.01031

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Wagner, Nicolas, Fan, Dongyang, Jaggi, Martin

arXiv.org Artificial IntelligenceApr-15-2024

We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low-Rank Adaptation (LoRA) and only exchange LoRA weight updates. Our protocols, driven by prediction and performance metrics, surpass both FedAvg and local fine-tuning methods, which is particularly evident in realistic scenarios with more diverse local data distributions. The results underscore the effectiveness of our approach in addressing heterogeneity and scarcity within local datasets.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.09753

Country:

Asia (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.88)

Industry:

Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Ashkboos, Saleh, Mohtashami, Amirkeivan, Croci, Maximilian L., Li, Bo, Jaggi, Martin, Alistarh, Dan, Hoefler, Torsten, Hensman, James

arXiv.org Artificial IntelligenceMar-30-2024

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: https://github.com/spcl/QuaRot.

large language model, machine learning, quantization, (19 more...)

arXiv.org Artificial Intelligence

2404.00456

Country: Europe (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Towards an empirical understanding of MoE design choices

Fan, Dongyang, Messmer, Bettina, Jaggi, Martin

arXiv.org Artificial IntelligenceFeb-20-2024

The Mixture of Experts (MoEs) has been receiving unprecedented attention in the LLM era. While initially it has been proposed by Jacobs et al. (1991) to encourage expert specialization when the model is under-parameterized to fit the whole data domain, the contemporary practices (Fedus et al., 2022; Shazeer et al., 2017) do not specifically seek for expert specialization aspects, instead, they use MoE as a tool to scale up model expressiveness at a reduced inference cost. A study by Zoph et al. (2022a) revealed the existence of expert specialization in encoder blocks, particularly at a lexicon level. Furthermore, the recent Mistral paper by Jiang et al. (2024) provides evidence that the router exhibits structured syntactic behavior rather than topic-level understanding. We posit that the cultivation of fine-grained expert specialization is facilitated by Token-level routing mechanisms.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.13089

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Makkuva, Ashok Vardhan, Bondaschi, Marco, Girish, Adway, Nagle, Alliot, Jaggi, Martin, Kim, Hyeji, Gastpar, Michael

arXiv.org Artificial IntelligenceFeb-6-2024

In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and systematic experiments to study the sequential modeling capabilities of transformers through the lens of Markov chains. Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance. In particular, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima and bad local minima contingent upon the specific data characteristics and the transformer architecture. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. We further investigate these findings in the broader context of higher order Markov chains and deeper architectures, and outline open problems in this arena. Code is available at \url{https://github.com/Bond1995/Markov}.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.04161

Country:

North America > United States (0.14)
Europe > Switzerland (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

LASER: Linear Compression in Wireless Distributed Optimization

Makkuva, Ashok Vardhan, Bondaschi, Marco, Vogels, Thijs, Jaggi, Martin, Kim, Hyeji, Gastpar, Michael C.

arXiv.org Artificial IntelligenceFeb-6-2024

Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce LASER: LineAr CompreSsion in WirEless DistRibuted Optimization. LASER capitalizes on the inherent low-rank structure of gradients and transmits them efficiently over the noisy channels. Whilst enjoying theoretical guarantees similar to those of the classical SGD, LASER shows consistent gains over baselines on a variety of practical benchmarks. In particular, it outperforms the state-of-the-art compression schemes on challenging computer vision and GPT language modeling tasks. On the latter, we obtain $50$-$64 \%$ improvement in perplexity over our baselines for noisy channels.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.13033

Country:

Asia > Thailand (0.14)
North America > United States (0.14)
Europe > Switzerland (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

DoGE: Domain Reweighting with Generalization Estimation

Fan, Simin, Pagliardini, Matteo, Jaggi, Martin

arXiv.org Artificial IntelligenceFeb-5-2024

The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two-stage process consisting of (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learned domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets better perplexity and few-shot reasoning accuracies across $6$ tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.

domain weight, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.15393

Country:

North America > Canada (0.14)
Europe > France (0.14)
Europe > Denmark (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

Add feedback

InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

Swamy, Vinitra, Blackwell, Julian, Frej, Jibril, Jaggi, Martin, Käser, Tanja

arXiv.org Artificial IntelligenceFeb-5-2024

Real-world interpretability for neural networks is a tradeoff between three concerns: 1) it requires humans to trust the explanation approximation (e.g. post-hoc approaches), 2) it compromises the understandability of the explanation (e.g. automatically identified feature masks), and 3) it compromises the model performance (e.g. decision trees). These shortcomings are unacceptable for human-facing domains, like education, healthcare, or natural language, which require trustworthy explanations, actionable interpretations, and accurate predictions. In this work, we present InterpretCC (interpretable conditional computation), a family of interpretable-by-design neural networks that guarantee human-centric interpretability while maintaining comparable performance to state-of-the-art models by adaptively and sparsely activating features before prediction. We extend this idea into an interpretable mixture-of-experts model, that allows humans to specify topics of interest, discretely separates the feature space for each data point into topical subnetworks, and adaptively and sparsely activates these topical subnetworks. We demonstrate variations of the InterpretCC architecture for text and tabular data across several real-world benchmarks: six online education courses, news classification, breast cancer diagnosis, and review sentiment.

artificial intelligence, machine learning, prediction, (19 more...)

arXiv.org Artificial Intelligence

2402.02933

Country:

Europe (0.46)
North America > United States (0.28)

Genre:

Research Report (0.84)
Instructional Material > Online (0.48)
Instructional Material > Course Syllabus & Notes (0.48)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Education > Educational Setting > Online (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Pagliardini, Matteo, Mohtashami, Amirkeivan, Fleuret, Francois, Jaggi, Martin

arXiv.org Artificial IntelligenceFeb-4-2024

The transformer architecture from Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2402.02622

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback