The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Carlsson, Fredrik, Liu, Fangyu, Ward, Daniel, Kurfali, Murathan, Nivre, Joakim

Dec-5-2024–arXiv.org Artificial Intelligence

This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples - a process we refer to as hyperfitting - the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token. Despite the recent rapid advancements in artificial intelligence spearheaded by Transformer-based large language models (LLMs) and their emergent phenomena (Wei et al., 2022b; Bubeck et al., 2023), models trained on next-token pre-training objectives often degenerate when producing longer texts. This is particularly true for greedy decoding, and has resulted in mitigation strategies such as repetition penalties (Keskar et al., 2019) and nucleus sampling (Holtzman et al., 2020). However, when removing these heuristics and simply picking the top-1 candidate at each time-step, LLMs display strong tendencies to repeat themselves at the token, phrase, and sentence level (Holtzman et al., 2020), as is exemplified in Figure 1. This is a recurrent phenomenon for which there are many proposed hypotheses but, to the best of our knowledge, no definitive explanation exists. Color indicating how repetitive the generated text is. Although these models achieve significantly worse validation loss, they produce texts that align markedly better with human preferences and automatic diversity metrics. Indeed, we find that hyperfitting state-of-the-art LLMs yields capabilities that outperform models with 10x the number of parameters.

dataset, llama 3, zhang, (17 more...)

arXiv.org Artificial Intelligence

Dec-5-2024

arXiv.org PDF

Add feedback

Country:
- Africa (0.04)
- North America
  - Jamaica (0.04)
  - United States
    - Kentucky (0.04)
    - Virginia (0.04)
    - Louisiana (0.04)
    - Delaware (0.04)
- Europe
  - Germany (0.04)
  - France (0.04)
  - United Kingdom
    - Scotland (0.04)
    - England > Oxfordshire
      - Oxford (0.04)
  - Sweden > Uppsala County
    - Uppsala (0.04)
- Asia
  - China (0.04)
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Middle East > Saudi Arabia
    - Asir Province > Abha (0.04)

Genre:
- Research Report > Experimental Study (0.67)

Industry:
- Law (1.00)
- Leisure & Entertainment > Sports (0.67)
- Health & Medicine > Therapeutic Area
  - Infections and Infectious Diseases (1.00)
  - Immunology (1.00)
- Government > Regional Government
  - North America Government > United States Government (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)