Goldfish: Monolingual Language Models for 350 Languages
Chang, Tyler A., Arnett, Catherine, Tu, Zhuowen, Bergen, Benjamin K.
–arXiv.org Artificial Intelligence
For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. However, using FLORES perplexity as a metric, we find that these models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B). To facilitate research that focuses on low-resource languages, we pre-train and release Goldfish, a suite of monolingual autoregressive Transformer language models up to 125M parameters for 350 languages. The Goldfish reach lower FLORES perplexities than BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages, despite each Goldfish model being over 10x smaller. However, the Goldfish significantly underperform larger multilingual models on reasoning benchmarks, suggesting that for low-resource languages, multilinguality primarily improves general reasoning abilities rather than basic text generation. We release models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) of text data where available. The Goldfish models are available as baselines, fine-tuning sources, or augmentations to existing models in low-resource NLP research, and they are further useful for crosslinguistic studies requiring maximally comparable models across languages.
arXiv.org Artificial Intelligence
Aug-19-2024
- Country:
- Africa > Benin (0.04)
- Asia
- Brunei (0.04)
- Indonesia > Bali (0.04)
- Japan
- Honshū > Kansai
- Kyoto Prefecture > Kyoto (0.04)
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Kansai
- Middle East
- Israel (0.04)
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Russia (0.04)
- Myanmar > Chin State
- Hakha (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Singapore (0.04)
- Philippines > Luzon
- Ilocos Region > Province of Pangasinan (0.04)
- Europe
- Czechia > Prague (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Sweden
- Vaestra Goetaland > Gothenburg (0.04)
- Östergötland County > Linköping (0.04)
- Belgium (0.04)
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Russia (0.04)
- Slovenia (0.04)
- Germany > Saxony
- Leipzig (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- North America
- Canada
- Dominican Republic (0.04)
- Mexico > Querétaro (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- New Mexico (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Washington > King County
- Seattle (0.04)
- California > San Diego County
- Oceania
- South America
- Chile > Santiago Metropolitan Region
- Santiago Province > Santiago (0.04)
- Peru > Cusco Department
- Cusco Province > Cusco (0.04)
- Chile > Santiago Metropolitan Region
- Genre:
- Research Report (0.50)
- Technology: