Dialectal and Low-Resource Machine Translation for Aromanian
Jerpelea, Alexandru-Iulius, Rădoi, Alina, Nisioi, Sergiu
–arXiv.org Artificial Intelligence
This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets, trained models, and associated tools are public: https://huggingface.co/aronlp and https://arotranslate.com
arXiv.org Artificial Intelligence
Jan-7-2025
- Country:
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Romania
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Albania (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- North Macedonia > Pelagonia Statistical Region
- Bitola Municipality > Bitola (0.04)
- Bulgaria (0.04)
- Belgium > Brussels-Capital Region
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Florida > Miami-Dade County
- Africa > Ethiopia
- Genre:
- Research Report (1.00)
- Industry:
- Leisure & Entertainment (0.46)
- Technology: