A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
Arnett, Catherine, Chang, Tyler A., Bergen, Benjamin K.
–arXiv.org Artificial Intelligence
How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.
arXiv.org Artificial Intelligence
Mar-1-2024
- Country:
- Asia
- Middle East > UAE (0.04)
- Singapore (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- United States > California
- San Diego County > San Diego (0.04)
- Canada > Ontario
- Asia
- Genre:
- Research Report (0.64)
- Technology: