Entropy and type-token ratio in gigaword corpora
Rosillo-Rodes, Pablo, Miguel, Maxi San, Sanchez, David
–arXiv.org Artificial Intelligence
Lexical diversity measures the vocabulary variation in texts. While its utility is evident for analyses in language change and applied linguistics, it is not yet clear how to operationalize this concept in a unique way. We here investigate entropy and text-token ratio, two widely employed metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a diverse testbed for a quantitative approach to lexical diversity. Strikingly, we find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Further, in the limit of large vocabularies we find an analytical expression that sheds light on the origin of this relation and its connection with both Zipf and Heaps laws. Our results then contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv.org Artificial Intelligence
Nov-15-2024
- Country:
- Asia > Middle East
- Republic of Türkiye (0.04)
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Middle East > Cyprus
- Netherlands > North Holland
- Amsterdam (0.04)
- Spain > Balearic Islands
- Switzerland > Geneva
- Geneva (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Denmark > Capital Region
- North America
- Bermuda (0.04)
- Canada > Rocky Mountains (0.04)
- Greenland (0.04)
- United States
- Illinois > Cook County
- Chicago (0.04)
- Minnesota (0.04)
- Rocky Mountains (0.04)
- Illinois > Cook County
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.88)
- Technology: