Not enough data to create a plot.
Try a different view from the menu above.
Carlsson, Fredrik
The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
Carlsson, Fredrik, Liu, Fangyu, Ward, Daniel, Kurfali, Murathan, Nivre, Joakim
This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples - a process we refer to as hyperfitting - the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token. Despite the recent rapid advancements in artificial intelligence spearheaded by Transformer-based large language models (LLMs) and their emergent phenomena (Wei et al., 2022b; Bubeck et al., 2023), models trained on next-token pre-training objectives often degenerate when producing longer texts. This is particularly true for greedy decoding, and has resulted in mitigation strategies such as repetition penalties (Keskar et al., 2019) and nucleus sampling (Holtzman et al., 2020). However, when removing these heuristics and simply picking the top-1 candidate at each time-step, LLMs display strong tendencies to repeat themselves at the token, phrase, and sentence level (Holtzman et al., 2020), as is exemplified in Figure 1. This is a recurrent phenomenon for which there are many proposed hypotheses but, to the best of our knowledge, no definitive explanation exists. Color indicating how repetitive the generated text is. Although these models achieve significantly worse validation loss, they produce texts that align markedly better with human preferences and automatic diversity metrics. Indeed, we find that hyperfitting state-of-the-art LLMs yields capabilities that outperform models with 10x the number of parameters.
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
Ekgren, Ariel, Gyllensten, Amaru Cuba, Stollenwerk, Felix, Öhman, Joey, Isbister, Tim, Gogoulou, Evangelia, Carlsson, Fredrik, Heiman, Alice, Casademont, Judit, Sahlgren, Magnus
We have faced all of these challenges in our work on developing the first native LLM for the There is a growing interest in building and applying Nordic (or, more accurately, North Germanic) languages. Large Language Models (LLMs) for languages The LLM, which we call GPT-SW3, other than English. This interest has is a continuation of our previous Swedish-only been fuelled partly by the unprecedented popularity model (Ekgren et al., 2022), and is a collection of ChatGPT
The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling
Öhman, Joey, Verlinden, Severine, Ekgren, Ariel, Gyllensten, Amaru Cuba, Isbister, Tim, Gogoulou, Evangelia, Carlsson, Fredrik, Sahlgren, Magnus
Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.