Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Ebrahimi, Abteen, McCarthy, Arya D., Oncevay, Arturo, Chiruzzo, Luis, Ortega, John E., Giménez-Lugo, Gustavo A., Coto-Solano, Rolando, Kann, Katharina
–arXiv.org Artificial Intelligence
Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.
arXiv.org Artificial Intelligence
Feb-15-2023
- Country:
- South America
- North America
- Dominican Republic (0.04)
- Costa Rica (0.04)
- United States
- Michigan (0.04)
- California (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Massachusetts > Suffolk County
- Boston (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Colorado > Boulder County
- Boulder (0.04)
- Europe
- Germany > Berlin (0.04)
- Switzerland > Geneva
- Geneva (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Italy > Tuscany
- Florence (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- China > Hong Kong (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Africa > Middle East
- Egypt > Giza Governorate > Giza (0.06)
- Genre:
- Research Report (0.82)
- Technology: