AITopics | Liskovich, Diana

Collaborating Authors

Liskovich, Diana

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Elhoushi, Mostafa, Shrivastava, Akshat, Liskovich, Diana, Hosmer, Basil, Wasti, Bram, Lai, Liangzhen, Mahmoud, Anas, Acun, Bilge, Agarwal, Saurabh, Roman, Ahmed, Aly, Ahmed A, Chen, Beidi, Wu, Carole-Jean

arXiv.org Artificial IntelligenceApr-29-2024

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.1671

Country:

North America > United States (1.00)
Europe (1.00)
Africa > Middle East > Egypt (0.99)
(2 more...)

Genre: Research Report (0.64)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.67)

Add feedback

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, Bikel, Dan, Blecher, Lukas, Ferrer, Cristian Canton, Chen, Moya, Cucurull, Guillem, Esiobu, David, Fernandes, Jude, Fu, Jeremy, Fu, Wenyin, Fuller, Brian, Gao, Cynthia, Goswami, Vedanuj, Goyal, Naman, Hartshorn, Anthony, Hosseini, Saghar, Hou, Rui, Inan, Hakan, Kardas, Marcin, Kerkez, Viktor, Khabsa, Madian, Kloumann, Isabel, Korenev, Artem, Koura, Punit Singh, Lachaux, Marie-Anne, Lavril, Thibaut, Lee, Jenya, Liskovich, Diana, Lu, Yinghai, Mao, Yuning, Martinet, Xavier, Mihaylov, Todor, Mishra, Pushkar, Molybog, Igor, Nie, Yixin, Poulton, Andrew, Reizenstein, Jeremy, Rungta, Rashi, Saladi, Kalyan, Schelten, Alan, Silva, Ruan, Smith, Eric Michael, Subramanian, Ranjan, Tan, Xiaoqing Ellen, Tang, Binh, Taylor, Ross, Williams, Adina, Kuan, Jian Xiang, Xu, Puxin, Yan, Zheng, Zarov, Iliyan, Zhang, Yuchen, Fan, Angela, Kambadur, Melanie, Narang, Sharan, Rodriguez, Aurelien, Stojnic, Robert, Edunov, Sergey, Scialom, Thomas

arXiv.org Artificial IntelligenceJul-19-2023

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

machine learning, natural language, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2307.09288

Country:

North America > United States (1.00)
Asia > Middle East > UAE (0.13)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Theory on Adam Instability in Large-Scale Machine Learning

Molybog, Igor, Albert, Peter, Chen, Moya, DeVito, Zachary, Esiobu, David, Goyal, Naman, Koura, Punit Singh, Narang, Sharan, Poulton, Andrew, Silva, Ruan, Tang, Binh, Liskovich, Diana, Xu, Puxin, Zhang, Yuchen, Kambadur, Melanie, Roller, Stephen, Zhang, Susan

arXiv.org Artificial IntelligenceApr-25-2023

Training instability reported by Chowdhery et al. [2022] is an interesting phenomenon that has only been reported for the large language models trained on an order of a trillion tokens, posing a threat to further scaling of the AI systems. Chowdhery et al. [2022] have observed dozens of spikes in the loss curve throughout training. To mitigate the issue, they re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200-500 data batches, in order to exclude batches that were seen right before and during the spike. In that case, the spike of the loss value did not repeat. The spikes were also not observed when the skipped data was fed through the model again after the aforementioned mitigation, which implies that the data itself did not cause the spike, but rather an interference of the data batch with the state of the model training run. The purpose of this work is to rigorously reproduce the experiment with a different hardware and software setup, come up with an explanation for the observed behavior supported by empirical evidence and theoretical arguments, and propose alternative ways of mitigating the issue. Loss spikes are difficult to study because any reproduction of these spikes at a smaller scale is not necessarily caused by or remediated by the same factors as in larger scales. We therefore analyze large-scale language modeling experiments, training four models between 7 billion and 546 billion parameters. The models are decoder-only transformers [Brown et al., 2020, Smith et al., 2022] with different depth and embedding dimensions and trained using the AdamW [Loshchilov and Hutter, 2017] algorithm with a linear learning rate schedule.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2304.09871

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback