LLMs for Extremely Low-Resource Finno-Ugric Languages
Purason, Taido, Kuulmets, Hele-Andra, Fishel, Mark
–arXiv.org Artificial Intelligence
The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on V\~oro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
arXiv.org Artificial Intelligence
Oct-24-2024
- Country:
- North America
- United States > Pennsylvania (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- Europe
- Monaco (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- Estonia > Tartu County
- Tartu (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Atlantic Ocean > North Atlantic Ocean
- Baltic Sea (0.04)
- Asia
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.46)
- Technology: