Liskovich, Diana
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Elhoushi, Mostafa, Shrivastava, Akshat, Liskovich, Diana, Hosmer, Basil, Wasti, Bram, Lai, Liangzhen, Mahmoud, Anas, Acun, Bilge, Agarwal, Saurabh, Roman, Ahmed, Aly, Ahmed A, Chen, Beidi, Wu, Carole-Jean
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, Bikel, Dan, Blecher, Lukas, Ferrer, Cristian Canton, Chen, Moya, Cucurull, Guillem, Esiobu, David, Fernandes, Jude, Fu, Jeremy, Fu, Wenyin, Fuller, Brian, Gao, Cynthia, Goswami, Vedanuj, Goyal, Naman, Hartshorn, Anthony, Hosseini, Saghar, Hou, Rui, Inan, Hakan, Kardas, Marcin, Kerkez, Viktor, Khabsa, Madian, Kloumann, Isabel, Korenev, Artem, Koura, Punit Singh, Lachaux, Marie-Anne, Lavril, Thibaut, Lee, Jenya, Liskovich, Diana, Lu, Yinghai, Mao, Yuning, Martinet, Xavier, Mihaylov, Todor, Mishra, Pushkar, Molybog, Igor, Nie, Yixin, Poulton, Andrew, Reizenstein, Jeremy, Rungta, Rashi, Saladi, Kalyan, Schelten, Alan, Silva, Ruan, Smith, Eric Michael, Subramanian, Ranjan, Tan, Xiaoqing Ellen, Tang, Binh, Taylor, Ross, Williams, Adina, Kuan, Jian Xiang, Xu, Puxin, Yan, Zheng, Zarov, Iliyan, Zhang, Yuchen, Fan, Angela, Kambadur, Melanie, Narang, Sharan, Rodriguez, Aurelien, Stojnic, Robert, Edunov, Sergey, Scialom, Thomas
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
A Theory on Adam Instability in Large-Scale Machine Learning
Molybog, Igor, Albert, Peter, Chen, Moya, DeVito, Zachary, Esiobu, David, Goyal, Naman, Koura, Punit Singh, Narang, Sharan, Poulton, Andrew, Silva, Ruan, Tang, Binh, Liskovich, Diana, Xu, Puxin, Zhang, Yuchen, Kambadur, Melanie, Roller, Stephen, Zhang, Susan
Training instability reported by Chowdhery et al. [2022] is an interesting phenomenon that has only been reported for the large language models trained on an order of a trillion tokens, posing a threat to further scaling of the AI systems. Chowdhery et al. [2022] have observed dozens of spikes in the loss curve throughout training. To mitigate the issue, they re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200-500 data batches, in order to exclude batches that were seen right before and during the spike. In that case, the spike of the loss value did not repeat. The spikes were also not observed when the skipped data was fed through the model again after the aforementioned mitigation, which implies that the data itself did not cause the spike, but rather an interference of the data batch with the state of the model training run. The purpose of this work is to rigorously reproduce the experiment with a different hardware and software setup, come up with an explanation for the observed behavior supported by empirical evidence and theoretical arguments, and propose alternative ways of mitigating the issue. Loss spikes are difficult to study because any reproduction of these spikes at a smaller scale is not necessarily caused by or remediated by the same factors as in larger scales. We therefore analyze large-scale language modeling experiments, training four models between 7 billion and 546 billion parameters. The models are decoder-only transformers [Brown et al., 2020, Smith et al., 2022] with different depth and embedding dimensions and trained using the AdamW [Loshchilov and Hutter, 2017] algorithm with a linear learning rate schedule.