Correlation Dimension of Natural Language in a Statistical Manifold
–arXiv.org Artificial Intelligence
The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barab\'asi-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.
arXiv.org Artificial Intelligence
May-15-2024
- Country:
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Leisure & Entertainment (0.46)
- Media > Music (0.46)
- Technology: