Autocorrelations Decay in Texts and Applicability Limits of Language Models

May-11-2023–arXiv.org Artificial Intelligence

To avoid any terminological doubt, when we write "models of the language", we refer to any models that explain some linguistic phenomena, while "language models" refer to probabilistic language models as defined in Subsection 2.3 Probabilistic Language Models. While not long ago probabilistic language models were just models that assign probabilities to sequences of words [4], now they are the cornerstone of any task in computational linguistics through few-shot learning [6], prompt engineering [38] or fine-tuning [13]. On the other hand, current language models fail to catch long-range dependencies in the text consistently. For example, text generation with maximum likelihood target leads to rapid text degeneration, and consistent text generation requires probabilistic sampling and other tricks [22]. Large language models such as GPT-3 [6] push the boundary of "short text" rather far (specifically, to 2048 tokens), but do not remove the problem. Our contributions in this work are the following: We explain how the laws of autocorrelations decay in texts are related to applicability of language models to long texts; We pioneer the use of pretrained word vectors for autocorrelation computations that allows us to study a widest range of autocorrelation distances; We show that the autocorrelations in literary texts decay according to power laws for all these distances; We show that distributional semantics typically provides coherent autocorrelations decay exponents for texts translated to multiple languages, unlike earlier flawed approaches; We show that the behavior of autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

May-11-2023

arXiv.org PDF

Add feedback

Country:
- Africa (0.04)
- Oceania > Australia (0.04)
- South America > French Guiana
  - Guyane > Cayenne (0.04)
- North America
  - Canada (0.04)
  - United States
    - Maine (0.04)
    - Arizona (0.04)
    - Texas > Harris County
      - Houston (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
    - California > San Diego County
      - San Diego (0.04)
- Europe
  - United Kingdom (0.45)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
  - Netherlands > South Holland
    - The Hague (0.04)
- Asia
  - Japan (0.04)
  - India (0.04)
  - China (0.04)
  - Russia > Siberian Federal District
    - Tomsk Oblast > Tomsk (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)

Genre:
- Personal > Interview (1.00)
- Research Report (0.82)

Industry:
- Leisure & Entertainment (1.00)
- Law (1.00)
- Government (0.92)
- Materials > Chemicals (0.67)
- Media
  - Film (1.00)
  - News (0.92)
- Health & Medicine
  - Consumer Health (1.00)
  - Therapeutic Area > Cardiology/Vascular Diseases (0.67)
- Education
  - Health & Safety > School Nutrition (1.00)
  - Educational Setting > K-12 Education (0.67)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis
  - Beverages (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Text Processing (0.92)
    - Large Language Model (0.86)
  - Machine Learning
    - Neural Networks (1.00)
    - Learning Graphical Models (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found