Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs
Chen, Haoyang, Tanaka-Ishii, Kumiko
–arXiv.org Artificial Intelligence
We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $β$ (vocabulary growth), Taylor's exponent $α$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $β$) and higher term consistency (higher $α$) than general texts. Within legal domain, statutory codes have the lowest $β$ and highest $α$, reflecting strict drafting conventions, while cases and deeds show higher $β$ and lower $α$. In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.
arXiv.org Artificial Intelligence
Sep-23-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Japan > Honshū
- Kansai > Osaka Prefecture
- Osaka (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.28)
- Kansai > Osaka Prefecture
- Middle East > Jordan (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Dominican Republic (0.04)
- United States (0.14)
- Oceania > Australia
- Asia
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Law (1.00)
- Technology: