Autocorrelations Decay in Texts and Applicability Limits of Language Models
Mikhaylovskiy, Nikolay, Churilov, Ilya
–arXiv.org Artificial Intelligence
To avoid any terminological doubt, when we write "models of the language", we refer to any models that explain some linguistic phenomena, while "language models" refer to probabilistic language models as defined in Subsection 2.3 Probabilistic Language Models. While not long ago probabilistic language models were just models that assign probabilities to sequences of words [4], now they are the cornerstone of any task in computational linguistics through few-shot learning [6], prompt engineering [38] or fine-tuning [13]. On the other hand, current language models fail to catch long-range dependencies in the text consistently. For example, text generation with maximum likelihood target leads to rapid text degeneration, and consistent text generation requires probabilistic sampling and other tricks [22]. Large language models such as GPT-3 [6] push the boundary of "short text" rather far (specifically, to 2048 tokens), but do not remove the problem. Our contributions in this work are the following: We explain how the laws of autocorrelations decay in texts are related to applicability of language models to long texts; We pioneer the use of pretrained word vectors for autocorrelation computations that allows us to study a widest range of autocorrelation distances; We show that the autocorrelations in literary texts decay according to power laws for all these distances; We show that distributional semantics typically provides coherent autocorrelations decay exponents for texts translated to multiple languages, unlike earlier flawed approaches; We show that the behavior of autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts.
arXiv.org Artificial Intelligence
May-11-2023
- Country:
- Africa (0.04)
- Asia
- China (0.04)
- India (0.04)
- Japan (0.04)
- Middle East > Qatar
- Russia > Siberian Federal District
- Tomsk Oblast > Tomsk (0.04)
- Europe
- Netherlands > South Holland
- The Hague (0.04)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.04)
- United Kingdom (0.45)
- Netherlands > South Holland
- North America
- Canada (0.04)
- United States
- Arizona (0.04)
- California > San Diego County
- San Diego (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Maine (0.04)
- Texas > Harris County
- Houston (0.04)
- Oceania > Australia (0.04)
- South America > French Guiana
- Genre:
- Personal > Interview (1.00)
- Research Report (0.82)
- Industry:
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis
- Beverages (0.67)
- Education
- Educational Setting > K-12 Education (0.67)
- Health & Safety > School Nutrition (1.00)
- Government (0.92)
- Health & Medicine
- Consumer Health (1.00)
- Therapeutic Area > Cardiology/Vascular Diseases (0.67)
- Law (1.00)
- Leisure & Entertainment (1.00)
- Materials > Chemicals (0.67)
- Media
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis
- Technology: