Joulin, Armand
Gemma 3 Technical Report
Gemma Team, null, Kamath, Aishwarya, Ferret, Johan, Pathak, Shreya, Vieillard, Nino, Merhej, Ramona, Perrin, Sarah, Matejovicova, Tatiana, Ramé, Alexandre, Rivière, Morgane, Rouillard, Louis, Mesnard, Thomas, Cideron, Geoffrey, Grill, Jean-bastien, Ramos, Sabela, Yvinec, Edouard, Casbon, Michelle, Pot, Etienne, Penchev, Ivo, Liu, Gaël, Visin, Francesco, Kenealy, Kathleen, Beyer, Lucas, Zhai, Xiaohai, Tsitsulin, Anton, Busa-Fekete, Robert, Feng, Alex, Sachdeva, Noveen, Coleman, Benjamin, Gao, Yi, Mustafa, Basil, Barr, Iain, Parisotto, Emilio, Tian, David, Eyal, Matan, Cherry, Colin, Peter, Jan-Thorsten, Sinopalnikov, Danila, Bhupatiraju, Surya, Agarwal, Rishabh, Kazemi, Mehran, Malkin, Dan, Kumar, Ravin, Vilar, David, Brusilovsky, Idan, Luo, Jiaming, Steiner, Andreas, Friesen, Abe, Sharma, Abhanshu, Sharma, Abheesht, Gilady, Adi Mayrav, Goedeckemeyer, Adrian, Saade, Alaa, Feng, Alex, Kolesnikov, Alexander, Bendebury, Alexei, Abdagic, Alvin, Vadi, Amit, György, András, Pinto, André Susano, Das, Anil, Bapna, Ankur, Miech, Antoine, Yang, Antoine, Paterson, Antonia, Shenoy, Ashish, Chakrabarti, Ayan, Piot, Bilal, Wu, Bo, Shahriari, Bobak, Petrini, Bryce, Chen, Charlie, Lan, Charline Le, Choquette-Choo, Christopher A., Carey, CJ, Brick, Cormac, Deutsch, Daniel, Eisenbud, Danielle, Cattle, Dee, Cheng, Derek, Paparas, Dimitris, Sreepathihalli, Divyashree Shivakumar, Reid, Doug, Tran, Dustin, Zelle, Dustin, Noland, Eric, Huizenga, Erwin, Kharitonov, Eugene, Liu, Frederick, Amirkhanyan, Gagik, Cameron, Glenn, Hashemi, Hadi, Klimczak-Plucińska, Hanna, Singh, Harman, Mehta, Harsh, Lehri, Harshal Tushar, Hazimeh, Hussein, Ballantyne, Ian, Szpektor, Idan, Nardini, Ivan, Pouget-Abadie, Jean, Chan, Jetha, Stanton, Joe, Wieting, John, Lai, Jonathan, Orbay, Jordi, Fernandez, Joseph, Newlan, Josh, Ji, Ju-yeong, Singh, Jyotinder, Black, Kat, Yu, Kathy, Hui, Kevin, Vodrahalli, Kiran, Greff, Klaus, Qiu, Linhai, Valentine, Marcella, Coelho, Marina, Ritter, Marvin, Hoffman, Matt, Watson, Matthew, Chaturvedi, Mayank, Moynihan, Michael, Ma, Min, Babar, Nabila, Noy, Natasha, Byrd, Nathan, Roy, Nick, Momchev, Nikola, Chauhan, Nilay, Sachdeva, Noveen, Bunyan, Oskar, Botarda, Pankil, Caron, Paul, Rubenstein, Paul Kishan, Culliton, Phil, Schmid, Philipp, Sessa, Pier Giuseppe, Xu, Pingmei, Stanczyk, Piotr, Tafti, Pouya, Shivanna, Rakesh, Wu, Renjie, Pan, Renke, Rokni, Reza, Willoughby, Rob, Vallu, Rohith, Mullins, Ryan, Jerome, Sammy, Smoot, Sara, Girgin, Sertan, Iqbal, Shariq, Reddy, Shashir, Sheth, Shruti, Põder, Siim, Bhatnagar, Sijal, Panyam, Sindhu Raghuram, Eiger, Sivan, Zhang, Susan, Liu, Tianqi, Yacovone, Trevor, Liechty, Tyler, Kalra, Uday, Evci, Utku, Misra, Vedant, Roseberry, Vincent, Feinberg, Vlad, Kolesnikov, Vlad, Han, Woohyun, Kwon, Woosuk, Chen, Xi, Chow, Yinlam, Zhu, Yuvein, Wei, Zichuan, Egyed, Zoltan, Cotruta, Victor, Giang, Minh, Kirk, Phoebe, Rao, Anand, Black, Kat, Babar, Nabila, Lo, Jessica, Moreira, Erica, Martins, Luiz Gustavo, Sanseviero, Omar, Gonzalez, Lucas, Gleicher, Zach, Warkentin, Tris, Mirrokni, Vahab, Senter, Evan, Collins, Eli, Barral, Joelle, Ghahramani, Zoubin, Hadsell, Raia, Matias, Yossi, Sculley, D., Petrov, Slav, Fiedel, Noah, Shazeer, Noam, Vinyals, Oriol, Dean, Jeff, Hassabis, Demis, Kavukcuoglu, Koray, Farabet, Clement, Buchatskaya, Elena, Alayrac, Jean-Baptiste, Anil, Rohan, Dmitry, null, Lepikhin, null, Borgeaud, Sebastian, Bachem, Olivier, Joulin, Armand, Andreev, Alek, Hardin, Cassidy, Dadashi, Robert, Hussenot, Léonard
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Suau, Xavier, Delobelle, Pieter, Metcalf, Katherine, Joulin, Armand, Apostoloff, Nicholas, Zappella, Luca, Rodríguez, Pau
An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AurA), an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AurA can achieve up to $2.2 \times$ reduction in toxicity with only a $0.72$ perplexity increase. We also show that AurA is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. AurA can be combined with pre-prompting strategies, boosting its average mitigation potential from $1.28\times$ to $2.35\times$. Moreover, AurA can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Vo, Huy V., Khalidov, Vasil, Darcet, Timothée, Moutakanni, Théo, Smetanin, Nikita, Szafraniec, Marc, Touvron, Hugo, Couprie, Camille, Oquab, Maxime, Joulin, Armand, Jégou, Hervé, Labatut, Patrick, Bojanowski, Piotr
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.
Time Sensitive Knowledge Editing through Efficient Finetuning
Ge, Xiou, Mousavi, Ali, Grave, Edouard, Joulin, Armand, Qian, Kun, Han, Benjamin, Arefiyan, Mostafa, Li, Yunyao
Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, keeping the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge editing (KE) method suffers from two limitations. First, the post-edit LLMs by such methods generally have poor capability in answering complex queries that require multi-hop reasoning. Second, the long run-time of such locate-and-edit methods to perform knowledge edits make it infeasible for large scale KE in practice. In this paper, we explore Parameter-Efficient Fine-Tuning (PEFT) techniques as an alternative for KE. We curate a more comprehensive temporal KE dataset with both knowledge update and knowledge injection examples for KE performance benchmarking. We further probe the effect of fine-tuning on a range of layers in an LLM for the multi-hop QA task. We find that PEFT performs better than locate-and-edit techniques for time-sensitive knowledge edits.
Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning
Moutakanni, Théo, Bojanowski, Piotr, Chassagnon, Guillaume, Hudelot, Céline, Joulin, Armand, LeCun, Yann, Muckley, Matthew, Oquab, Maxime, Revel, Marie-Pierre, Vakalopoulou, Maria
AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, null, Mesnard, Thomas, Hardin, Cassidy, Dadashi, Robert, Bhupatiraju, Surya, Pathak, Shreya, Sifre, Laurent, Rivière, Morgane, Kale, Mihir Sanjay, Love, Juliette, Tafti, Pouya, Hussenot, Léonard, Sessa, Pier Giuseppe, Chowdhery, Aakanksha, Roberts, Adam, Barua, Aditya, Botev, Alex, Castro-Ros, Alex, Slone, Ambrose, Héliou, Amélie, Tacchetti, Andrea, Bulanova, Anna, Paterson, Antonia, Tsai, Beth, Shahriari, Bobak, Lan, Charline Le, Choquette-Choo, Christopher A., Crepy, Clément, Cer, Daniel, Ippolito, Daphne, Reid, David, Buchatskaya, Elena, Ni, Eric, Noland, Eric, Yan, Geng, Tucker, George, Muraru, George-Christian, Rozhdestvenskiy, Grigory, Michalewski, Henryk, Tenney, Ian, Grishchenko, Ivan, Austin, Jacob, Keeling, James, Labanowski, Jane, Lespiau, Jean-Baptiste, Stanway, Jeff, Brennan, Jenny, Chen, Jeremy, Ferret, Johan, Chiu, Justin, Mao-Jones, Justin, Lee, Katherine, Yu, Kathy, Millican, Katie, Sjoesund, Lars Lowe, Lee, Lisa, Dixon, Lucas, Reid, Machel, Mikuła, Maciej, Wirth, Mateo, Sharman, Michael, Chinaev, Nikolai, Thain, Nithum, Bachem, Olivier, Chang, Oscar, Wahltinez, Oscar, Bailey, Paige, Michel, Paul, Yotov, Petko, Chaabouni, Rahma, Comanescu, Ramona, Jana, Reena, Anil, Rohan, McIlroy, Ross, Liu, Ruibo, Mullins, Ryan, Smith, Samuel L, Borgeaud, Sebastian, Girgin, Sertan, Douglas, Sholto, Pandya, Shree, Shakeri, Siamak, De, Soham, Klimenko, Ted, Hennigan, Tom, Feinberg, Vlad, Stokowiec, Wojciech, Chen, Yu-hui, Ahmed, Zafarali, Gong, Zhitao, Warkentin, Tris, Peran, Ludovic, Giang, Minh, Farabet, Clément, Vinyals, Oriol, Dean, Jeff, Kavukcuoglu, Koray, Hassabis, Demis, Ghahramani, Zoubin, Eck, Douglas, Barral, Joelle, Pereira, Fernando, Collins, Eli, Joulin, Armand, Fiedel, Noah, Senter, Evan, Andreev, Alek, Kenealy, Kathleen
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Botev, Aleksandar, De, Soham, Smith, Samuel L, Fernando, Anushan, Muraru, George-Cristian, Haroun, Ruba, Berrada, Leonard, Pascanu, Razvan, Sessa, Pier Giuseppe, Dadashi, Robert, Hussenot, Léonard, Ferret, Johan, Girgin, Sertan, Bachem, Olivier, Andreev, Alek, Kenealy, Kathleen, Mesnard, Thomas, Hardin, Cassidy, Bhupatiraju, Surya, Pathak, Shreya, Sifre, Laurent, Rivière, Morgane, Kale, Mihir Sanjay, Love, Juliette, Tafti, Pouya, Joulin, Armand, Fiedel, Noah, Senter, Evan, Chen, Yutian, Srinivasan, Srivatsan, Desjardins, Guillaume, Budden, David, Doucet, Arnaud, Vikram, Sharad, Paszke, Adam, Gale, Trevor, Borgeaud, Sebastian, Chen, Charlie, Brock, Andy, Paterson, Antonia, Brennan, Jenny, Risdal, Meg, Gundluru, Raj, Devanathan, Nesh, Mooney, Paul, Chauhan, Nilay, Culliton, Phil, Martins, Luiz GUStavo, Bandy, Elisa, Huntsperger, David, Cameron, Glenn, Zucker, Arthur, Warkentin, Tris, Peran, Ludovic, Giang, Minh, Ghahramani, Zoubin, Farabet, Clément, Kavukcuoglu, Koray, Hassabis, Demis, Hadsell, Raia, Teh, Yee Whye, de Frietas, Nando
We introduce RecurrentGemma, an open language model which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.
The effectiveness of MAE pre-pretraining for billion-scale pretraining
Singh, Mannat, Duval, Quentin, Alwala, Kalyan Vasudev, Fan, Haoqi, Aggarwal, Vaibhav, Adcock, Aaron, Joulin, Armand, Dollár, Piotr, Feichtenhofer, Christoph, Girshick, Ross, Girdhar, Rohit, Misra, Ishan
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.7%), ImageNet-ReaL (91.1%), 1-shot ImageNet-1k (63.6%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images, and our models are available publicly.
PaSS: Parallel Speculative Sampling
Monea, Giovanni, Joulin, Armand, Grave, Edouard
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use parallel decoding as a way to draft multiple tokens from a single model with no computational cost, nor the need for a second model. Our approach only requires an additional input token that marks the words that will be generated simultaneously. We show promising performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$ additional parameters.
ImageBind: One Embedding Space To Bind Them All
Girdhar, Rohit, El-Nouby, Alaaeldin, Liu, Zhuang, Singh, Mannat, Alwala, Kalyan Vasudev, Joulin, Armand, Misra, Ishan
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.