AITopics | shazeer

Collaborating Authors

shazeer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

48237d9f2dea8c74c2a72126cf63d933-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 11:45:58 GMT

InComputerVision,however,almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks.

artificial intelligence, arxivpreprintarxiv, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Vision (0.67)

Add feedback

Load Balancing Mixture of Experts with Similarity Preserving Routers

Omi, Nabil, Sen, Siddhartha, Farhadi, Ali

arXiv.org Artificial IntelligenceOct-14-2025

Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.14038

Country: Europe (0.46)

Genre: Research Report > New Finding (0.34)

Industry: Energy > Power Industry (1.00)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Faster MoE LLM Inference for Extremely Large Models

Yang, Haoqi, Shi, Luohe, Li, Qiwei, Li, Zuchao, Wang, Ping, Du, Bo, Shen, Mengjia, Zhao, Hai

arXiv.org Artificial IntelligenceMay-7-2025

Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10\% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement.

large language model, machine learning, underreview, (19 more...)

arXiv.org Artificial Intelligence

2505.03531

Country:

Europe (0.93)
Asia (0.68)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Selective Attention Improves Transformer

Leviathan, Yaniv, Kalman, Matan, Matias, Yossi

arXiv.org Artificial IntelligenceOct-3-2024

We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with 2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity. Different tasks have different memory requirements. On one extreme, copying an arbitrary sequence requires retaining all sequence elements in memory. On the other extreme, determining whether a specific element appeared at least once, only requires persisting a constant amount of memory. Transformers (Vaswani et al., 2017) keep the entire history in their context buffers, allowing them to solve tasks such as copying, while famously leading to their squared attention cost. RNNs (Rumelhart et al., 1986) and their modern structured state space variants (Gu et al., 2022; Gu & Dao, 2024) keep only a constant-sized sketch of the history, making inference cost linear, but rendering them incapable of solving tasks such as arbitrary string copying.

arxiv, selective attention, transformer, (13 more...)

arXiv.org Artificial Intelligence

2410.02703

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (0.54)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Was Linguistic A.I. Created by Accident?

The New YorkerAug-23-2024, 10:00:00 GMT

In the spring of 2017, in a room on the second floor of Google's Building 1965, a college intern named Aidan Gomez stretched out, exhausted. It was three in the morning, and Gomez and Ashish Vaswani, a scientist focussed on natural language processing, were working on their team's contribution to the Neural Information Processing Systems conference, the biggest annual meeting in the field of artificial intelligence. Along with the rest of their eight-person group at Google, they had been pushing flat out for twelve weeks, sometimes sleeping in the office, on couches by a curtain that had a neuron-like pattern. They were nearing the finish line, but Gomez didn't have the energy to go out to a bar and celebrate. He couldn't have even if he'd wanted to: he was only twenty, too young to drink in the United States.

gomez, transformer, translation, (14 more...)

The New Yorker

Country:

North America > United States (0.24)
North America > Canada > Ontario > Toronto (0.14)

Industry:

Leisure & Entertainment (0.70)
Media (0.48)
Information Technology > Services (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, null, Mesnard, Thomas, Hardin, Cassidy, Dadashi, Robert, Bhupatiraju, Surya, Pathak, Shreya, Sifre, Laurent, Rivière, Morgane, Kale, Mihir Sanjay, Love, Juliette, Tafti, Pouya, Hussenot, Léonard, Sessa, Pier Giuseppe, Chowdhery, Aakanksha, Roberts, Adam, Barua, Aditya, Botev, Alex, Castro-Ros, Alex, Slone, Ambrose, Héliou, Amélie, Tacchetti, Andrea, Bulanova, Anna, Paterson, Antonia, Tsai, Beth, Shahriari, Bobak, Lan, Charline Le, Choquette-Choo, Christopher A., Crepy, Clément, Cer, Daniel, Ippolito, Daphne, Reid, David, Buchatskaya, Elena, Ni, Eric, Noland, Eric, Yan, Geng, Tucker, George, Muraru, George-Christian, Rozhdestvenskiy, Grigory, Michalewski, Henryk, Tenney, Ian, Grishchenko, Ivan, Austin, Jacob, Keeling, James, Labanowski, Jane, Lespiau, Jean-Baptiste, Stanway, Jeff, Brennan, Jenny, Chen, Jeremy, Ferret, Johan, Chiu, Justin, Mao-Jones, Justin, Lee, Katherine, Yu, Kathy, Millican, Katie, Sjoesund, Lars Lowe, Lee, Lisa, Dixon, Lucas, Reid, Machel, Mikuła, Maciej, Wirth, Mateo, Sharman, Michael, Chinaev, Nikolai, Thain, Nithum, Bachem, Olivier, Chang, Oscar, Wahltinez, Oscar, Bailey, Paige, Michel, Paul, Yotov, Petko, Chaabouni, Rahma, Comanescu, Ramona, Jana, Reena, Anil, Rohan, McIlroy, Ross, Liu, Ruibo, Mullins, Ryan, Smith, Samuel L, Borgeaud, Sebastian, Girgin, Sertan, Douglas, Sholto, Pandya, Shree, Shakeri, Siamak, De, Soham, Klimenko, Ted, Hennigan, Tom, Feinberg, Vlad, Stokowiec, Wojciech, Chen, Yu-hui, Ahmed, Zafarali, Gong, Zhitao, Warkentin, Tris, Peran, Ludovic, Giang, Minh, Farabet, Clément, Vinyals, Oriol, Dean, Jeff, Kavukcuoglu, Koray, Hassabis, Demis, Ghahramani, Zoubin, Eck, Douglas, Barral, Joelle, Pereira, Fernando, Collins, Eli, Joulin, Armand, Fiedel, Noah, Senter, Evan, Andreev, Alek, Kenealy, Kathleen

arXiv.org Artificial IntelligenceApr-16-2024

This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.

gemini research, gemma, open model, (15 more...)

arXiv.org Artificial Intelligence

2403.08295

Country:

North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.66)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Differentially Private Training of Mixture of Experts Models

Tholoniat, Pierre, Inan, Huseyin A., Kulkarni, Janardhan, Sim, Robert

arXiv.org Artificial IntelligenceFeb-11-2024

This position paper investigates the integration of Differential Privacy (DP) in the training of Mixture of Experts (MoE) models within the field of natural language processing. As Large Language Models (LLMs) scale to billions of parameters, leveraging expansive datasets, they exhibit enhanced linguistic capabilities and emergent abilities. However, this growth raises significant computational and privacy concerns. Our study addresses these issues by exploring the potential of MoE models, known for their computational efficiency, and the application of DP, a standard for privacy preservation. We present the first known attempt to train MoE models under the constraints of DP, addressing the unique challenges posed by their architecture and the complexities of DP integration. Our initial experimental studies demonstrate that MoE models can be effectively trained with DP, achieving performance that is competitive with their non-private counterparts. This initial study aims to provide valuable insights and ignite further research in the domain of privacy-preserving MoE models, softly laying the groundwork for prospective developments in this evolving field.

gradient, moe model, per-sample gradient, (13 more...)

arXiv.org Artificial Intelligence

2402.07334

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task

Garcia, Gabriel Lino, Paiola, Pedro Henrique, Morelli, Luis Henrique, Candido, Giovani, Júnior, Arnaldo Cândido, Jodas, Danilo Samuel, Afonso, Luis C. S., Guilherme, Ivan Rizzo, Penteado, Bruno Elias, Papa, João Paulo

arXiv.org Artificial IntelligenceJan-5-2024

Large Language Models (LLMs) are increasingly bringing advances to Natural Language Processing. However, low-resource languages, those lacking extensive prominence in datasets for various NLP tasks, or where existing datasets are not as substantial, such as Portuguese, already obtain several benefits from LLMs, but not to the same extent. LLMs trained on multilingual datasets normally struggle to respond to prompts in Portuguese satisfactorily, presenting, for example, code switching in their responses. This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode in two versions: 7B and 13B. We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning, and compare it with other LLMs. Our main contribution is to bring an LLM with satisfactory results in the Portuguese language, as well as to provide a model that is free for research or commercial purposes.

architecture, dataset, portuguese, (14 more...)

arXiv.org Artificial Intelligence

2401.02909

Country:

South America > Brazil > São Paulo (0.04)
Asia > Middle East > Israel (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, Joshua, Lee-Thorp, James, de Jong, Michiel, Zemlyanskiy, Yury, Lebrón, Federico, Sanghai, Sumit

arXiv.org Artificial IntelligenceDec-23-2023

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

checkpoint, computational linguistic, multi-query attention, (12 more...)

arXiv.org Artificial Intelligence

2305.13245

Country:

North America > United States > California (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(5 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.90)

Add feedback

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Javadi, Farnoosh, Ahmed, Walid, Hajimolahoseini, Habib, Ataiefard, Foozhan, Hassanpour, Mohammad, Asani, Saina, Wen, Austin, Awad, Omar Mohamed, Liu, Kangling, Liu, Yang

arXiv.org Artificial IntelligenceDec-13-2023

Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.

matrix, model size, transformer, (14 more...)

arXiv.org Artificial Intelligence

2311.03426

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > Canada > British Columbia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback