Donchev, Yani
Scaling Instructable Agents Across Many Simulated Worlds
SIMA Team, null, Raad, Maria Abi, Ahuja, Arun, Barros, Catarina, Besse, Frederic, Bolt, Andrew, Bolton, Adrian, Brownfield, Bethanie, Buttimore, Gavin, Cant, Max, Chakera, Sarah, Chan, Stephanie C. Y., Clune, Jeff, Collister, Adrian, Copeman, Vikki, Cullum, Alex, Dasgupta, Ishita, de Cesare, Dario, Di Trapani, Julia, Donchev, Yani, Dunleavy, Emma, Engelcke, Martin, Faulkner, Ryan, Garcia, Frankie, Gbadamosi, Charles, Gong, Zhitao, Gonzales, Lucy, Gupta, Kshitij, Gregor, Karol, Hallingstad, Arne Olav, Harley, Tim, Haves, Sam, Hill, Felix, Hirst, Ed, Hudson, Drew A., Hudson, Jony, Hughes-Fitt, Steph, Rezende, Danilo J., Jasarevic, Mimi, Kampis, Laura, Ke, Rosemary, Keck, Thomas, Kim, Junkyung, Knagg, Oscar, Kopparapu, Kavya, Lampinen, Andrew, Legg, Shane, Lerchner, Alexander, Limont, Marjorie, Liu, Yulan, Loks-Thompson, Maria, Marino, Joseph, Cussons, Kathryn Martin, Matthey, Loic, Mcloughlin, Siobhan, Mendolicchio, Piermaria, Merzic, Hamza, Mitenkova, Anna, Moufarek, Alexandre, Oliveira, Valeria, Oliveira, Yanko, Openshaw, Hannah, Pan, Renke, Pappu, Aneesh, Platonov, Alex, Purkiss, Ollie, Reichert, David, Reid, John, Richemond, Pierre Harvey, Roberts, Tyson, Ruscoe, Giles, Elias, Jaume Sanchez, Sandars, Tasha, Sawyer, Daniel P., Scholtes, Tim, Simmons, Guy, Slater, Daniel, Soyer, Hubert, Strathmann, Heiko, Stys, Peter, Tam, Allison C., Teplyashin, Denis, Terzi, Tayfun, Vercelli, Davide, Vujatovic, Bojan, Wainwright, Marcus, Wang, Jane X., Wang, Zhengdong, Wierstra, Daan, Williams, Duncan, Wong, Nathaniel, York, Sarah, Young, Nick
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
DiPaCo: Distributed Path Composition
Douillard, Arthur, Feng, Qixuan, Rusu, Andrei A., Kuncoro, Adhiguna, Donchev, Yani, Chhaparia, Rachita, Gog, Ionel, Ranzato, Marc'Aurelio, Shen, Jiajun, Szlam, Arthur
Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.
DiLoCo: Distributed Low-Communication Training of Language Models
Douillard, Arthur, Feng, Qixuan, Rusu, Andrei A., Chhaparia, Rachita, Donchev, Yani, Kuncoro, Adhiguna, Ranzato, Marc'Aurelio, Szlam, Arthur, Shen, Jiajun
Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.