Analyzing Cognitive Plausibility of Subword Tokenization
–arXiv.org Artificial Intelligence
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.
arXiv.org Artificial Intelligence
Oct-20-2023
- Country:
- Oceania > Australia
- North America
- Dominican Republic (0.04)
- United States
- Maryland > Baltimore (0.04)
- Washington > King County
- Seattle (0.04)
- Europe
- Netherlands > North Holland
- Amsterdam (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Germany
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Netherlands > North Holland
- Asia
- China (0.04)
- Middle East > Israel
- Southern District > Beer-Sheva (0.04)
- Africa > Kenya
- Mandera County > Mandera (0.04)
- Genre:
- Research Report
- New Finding (0.88)
- Experimental Study (0.68)
- Research Report
- Technology: