Goto

Collaborating Authors

 gpt2 model


NNTile: a machine learning framework capable of training extremely large GPT language models on a single node

Mikhalev, Aleksandr, Katrutsa, Aleksandr, Sozykin, Konstantin, Oseledets, Ivan

arXiv.org Artificial Intelligence

This study presents an NNTile framework for training large deep neural networks in heterogeneous clusters. The NNTile is based on a StarPU library, which implements task-based parallelism and schedules all provided tasks onto all available processing units (CPUs and GPUs). It means that a particular operation, necessary to train a large neural network, can be performed on any of the CPU cores or GPU devices, depending on automatic scheduling decisions. Such an approach shifts the burden of deciding where to compute and when to communicate from a human being to an automatic decision maker, whether a simple greedy heuristic or a complex AI-based software. The performance of the presented tool for training large language models is demonstrated in extensive numerical experiments.


Language Models Grow Less Humanlike beyond Phase Transition

Aoyama, Tatsuya, Wilcox, Ethan

arXiv.org Artificial Intelligence

LMs' alignment with human reading behavior (i.e. psychometric predictive power; PPP) is known to improve during pretraining up to a tipping point, beyond which it either plateaus or degrades. Various factors, such as word frequency, recency bias in attention, and context size, have been theorized to affect PPP, yet there is no current account that explains why such a tipping point exists, and how it interacts with LMs' pretraining dynamics more generally. We hypothesize that the underlying factor is a pretraining phase transition, characterized by the rapid emergence of specialized attention heads. We conduct a series of correlational and causal experiments to show that such a phase transition is responsible for the tipping point in PPP. We then show that, rather than producing attention patterns that contribute to the degradation in PPP, phase transitions alter the subsequent learning dynamics of the model, such that further training keeps damaging PPP.


From Attention to Activation: Unravelling the Enigmas of Large Language Models

Kaul, Prannay, Ma, Chengcheng, Elezi, Ismail, Deng, Jiankang

arXiv.org Artificial Intelligence

We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are unable to do. In summary, our methods reduce the attention proportion on the first token from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to 3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3.


Evolving Subnetwork Training for Large Language Models

Li, Hanqi, Chen, Lu, Ma, Da, Wu, Zijian, Zhu, Su, Yu, Kai

arXiv.org Artificial Intelligence

Large language models have ushered in a new era of artificial intelligence research. However, their substantial training costs hinder further development and widespread adoption. In this paper, inspired by the redundancy in the parameters of large language models, we propose a novel training paradigm: Evolving Subnetwork Training (EST). EST samples subnetworks from the layers of the large language model and from commonly used modules within each layer, Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP). By gradually increasing the size of the subnetworks during the training process, EST can save the cost of training. We apply EST to train GPT2 model and TinyLlama model, resulting in 26.7\% FLOPs saving for GPT2 and 25.0\% for TinyLlama without an increase in loss on the pre-training dataset. Moreover, EST leads to performance improvements in downstream tasks, indicating that it benefits generalization. Additionally, we provide intuitive theoretical studies based on training dynamics and Dropout theory to ensure the feasibility of EST. Our code is available at https://github.com/OpenDFM/EST.


Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions

Li, Ruizhe, Gao, Yanjun

arXiv.org Artificial Intelligence

Large Language Models (LLMs), such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse "anchored bias" in the GPT-2 family, where they consistently favour the first choice'A' in MCQs during inference. This anchored bias challenges the integrity of GPT-2's decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the "logit lens" method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice'A', we effectively mitigate the anchored bias. Our interventions not only mitigate the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs.


An Ensemble Approach to Personalized Real Time Predictive Writing for Experts

Prosad, Sourav, Polavarapu, Viswa Datha, Harsola, Shrutendra

arXiv.org Artificial Intelligence

Completing a sentence, phrase or word after typing few words / characters is very helpful for Intuit financial experts, while taking notes or having a live chat with users, since they need to write complex financial concepts more efficiently and accurately many times in a day. In this paper, we tie together different approaches like large language models, traditional Markov Models and char level models to create an end-to-end system to provide personalised sentence/word auto-complete suggestions to experts, under strict latency constraints. Proposed system can auto-complete sentences, phrases or words while writing with personalisation and can be trained with very less data and resources with good efficiency. Our proposed system is not only efficient and personalized but also robust as it leverages multiple machine learning techniques along with transfer learning approach to fine tune large language model with Intuit specific data. This ensures that even in cases of rare or unusual phrases, the system can provide relevant auto-complete suggestions in near real time. Survey has showed that this system saves expert note-taking time and boosts expert confidence in their communication with teammates and clients. Since enabling this predictive writing feature for QBLive experts, more than a million keystrokes have been saved based on these suggestions. We have done comparative study for our ensemble choice. Moreover this feature can be integrated with any product which has writing facility within a very short period of time.


Opinion Mining Using Population-tuned Generative Language Models

Susaiyah, Allmin, Pandya, Abhinay, Härmä, Aki

arXiv.org Artificial Intelligence

We present a novel method for mining opinions from text collections using generative language models trained on data collected from different populations. We describe the basic definitions, methodology and a generic algorithm for opinion insight mining. We demonstrate the performance of our method in an experiment where a pre-trained generative model is fine-tuned using specifically tailored content with unnatural and fully annotated opinions. We show that our approach can learn and transfer the opinions to the semantic classes while maintaining the proportion of polarisation. Finally, we demonstrate the usage of an insight mining system to scale up the discovery of opinion insights from a real text corpus.


Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

Lai, Viet Dac, Salinas, Abel, Tan, Hao, Bui, Trung, Tran, Quan, Yoon, Seunghyun, Deilamsalehy, Hanieh, Dernoncourt, Franck, Nguyen, Thien Huu

arXiv.org Artificial Intelligence

Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability. While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. The experiments show that our method achieves state-of-the-art performance on the ASR test set on two benchmark datasets for punctuation restoration.


Test-Time Training on Nearest Neighbors for Large Language Models

Hardt, Moritz, Sun, Yu

arXiv.org Artificial Intelligence

Many recent efforts aim to augment language models with relevant information retrieved from a database at test time. We avoid the need for prompt engineering by directly fine-tuning the model on data retrieved at test time using its standard training setup. For this purpose, we build a large-scale distributed nearest neighbor index based on text embeddings of the Pile dataset. Given a query to a language model, our system retrieves the neighbors of the query and fine-tunes the model on the text data corresponding to those neighbors. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than twenty language modeling tasks in the Pile benchmark. For example, test-time training significantly narrows the performance gap between a small GPT2 model and a GPTNeo model, more than ten times larger, that was specifically trained to convergence on the Pile. Sufficient index quality and size, however, are important. Our work establishes a valuable first baseline for implementing test-time training in the context of large language models, opening the door to numerous promising research avenues.