AITopics | Lewis, Mike

Collaborating Authors

Lewis, Mike

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Compute Optimal Scaling of Skills: Knowledge vs Reasoning

Roberts, Nicholas, Chatterji, Niladri, Narang, Sharan, Lewis, Mike, Hupkes, Dieuwke

arXiv.org Artificial IntelligenceMar-13-2025

Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.10061

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

BTS: Harmonizing Specialized Experts into a Generalist LLM

Zhang, Qizhen, Bhargava, Prajjwal, Bi, Chloe, Cai, Chris X., Foerster, Jakob, Fu, Jeremy, Koura, Punit Singh, Silva, Ruan, Shen, Sheng, Dinan, Emily, Gururangan, Suchin, Lewis, Mike

arXiv.org Artificial IntelligenceJan-31-2025

We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2502.00075

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

Held, William, Paranjape, Bhargavi, Koura, Punit Singh, Lewis, Mike, Zhang, Frank, Mihaylov, Todor

arXiv.org Artificial IntelligenceJan-23-2025

Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute-and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by 200x Compared to manual (Groeneveld et al., 2024, OLMo), heuristic (Chung et al., 2023, UniMax), and learned (Xie et al., 2024, DoReMi) data mixes, UtiliMax leads to more compute efficient models that perform better on average across tasks. Large Language Model (LLM) pretraining data increasingly consists of sub-corpora from many sources covering multiple domains and varying in size (Gao et al., 2020; Du et al., 2022; TogetherAI, Work completed during an internship at Meta AI. FLOPs from Llama 70B on 2.1 million tokens needed for MEDU using the FLOP equations from Hoffmann et al. (2022) Unlike traditional multi-task learning scenarios, datasets are not necessarily aligned with a specific intended use. Moreover, "intended usage" is often multi-functional as LLMs are being developed for general-purpose functionality (Eloundou et al., 2024; Qin et al., 2023). Given multiple training corpora and multiple downstream goals, how should we sample from each corpus to get the best possible model? Prior work has explored heuristic (Rae et al., 2021; Soldaini et al., 2024) and learned (Xie et al., 2024; Albalak et al., 2023) approaches to solve this. However, there is minimal comparison between these methods using the same data and model configuration. Furthermore, it is unclear whether these approaches are robust to the impacts of epoching which is critical as frontier models are increasingly data-constrained (Villalobos et al., 2024; Longpre et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.11747

Country: Europe > United Kingdom > Scotland (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Byte Latent Transformer: Patches Scale Better Than Tokens

Pagnoni, Artidoro, Pasunuru, Ram, Rodriguez, Pedro, Nguyen, John, Muller, Benjamin, Li, Margaret, Zhou, Chunting, Yu, Lili, Weston, Jason, Zettlemoyer, Luke, Ghosh, Gargi, Lewis, Mike, Holtzman, Ari, Iyer, Srinivasan

arXiv.org Artificial IntelligenceDec-13-2024

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

byte, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2412.09871

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Liang, Weixin, Yu, Lili, Luo, Liang, Iyer, Srinivasan, Dong, Ning, Zhou, Chunting, Ghosh, Gargi, Lewis, Mike, Yih, Wen-tau, Zettlemoyer, Luke, Lin, Xi Victoria

arXiv.org Artificial IntelligenceNov-7-2024

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.04996

Country:

North America > United States (0.27)
Europe > Spain (0.27)

Genre: Research Report > New Finding (0.92)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Law of the Weakest Link: Cross Capabilities of Large Language Models

Zhong, Ming, Zhang, Aston, Wang, Xuewei, Hou, Rui, Xiong, Wenhan, Zhu, Chenguang, Chen, Zhengxing, Tan, Liang, Bi, Chloe, Lewis, Mike, Popuri, Sravya, Narang, Sharan, Kambadur, Melanie, Mahajan, Dhruv, Edunov, Sergey, Han, Jiawei, van der Maaten, Laurens

arXiv.org Artificial IntelligenceOct-2-2024

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2409.19951

Country:

Africa (0.92)
Europe > Austria > Vienna (0.14)
North America > United States > Illinois (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games (1.00)
Law (0.93)
Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zhong, Zexuan, Xia, Mengzhou, Chen, Danqi, Lewis, Mike

arXiv.org Artificial IntelligenceMay-5-2024

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.03133

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.68)

Add feedback

Efficient Streaming Language Models with Attention Sinks

Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, Lewis, Mike

arXiv.org Artificial IntelligenceDec-11-2023

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2309.17453

Country:

North America > United States (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Shi, Weijia, Min, Sewon, Lomeli, Maria, Zhou, Chunting, Li, Margaret, James, Rich, Lin, Xi Victoria, Smith, Noah A., Zettlemoyer, Luke, Yih, Scott, Lewis, Mike

arXiv.org Artificial IntelligenceNov-30-2023

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.10638

Country:

Europe (0.68)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

Effective Long-Context Scaling of Foundation Models

Xiong, Wenhan, Liu, Jingyu, Molybog, Igor, Zhang, Hejia, Bhargava, Prajjwal, Hou, Rui, Martin, Louis, Rungta, Rashi, Sankararaman, Karthik Abinav, Oguz, Barlas, Khabsa, Madian, Fang, Han, Mehdad, Yashar, Narang, Sharan, Malik, Kshitiz, Fan, Angela, Bhosale, Shruti, Edunov, Sergey, Lewis, Mike, Wang, Sinong, Ma, Hao

arXiv.org Artificial IntelligenceNov-13-2023

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2309.16039

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback