Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Ebrahimi, Abteen, McCarthy, Arya D., Oncevay, Arturo, Chiruzzo, Luis, Ortega, John E., Giménez-Lugo, Gustavo A., Coto-Solano, Rolando, Kann, Katharina

Feb-15-2023–arXiv.org Artificial Intelligence

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.

artificial intelligence, computational linguistic, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-15-2023

arXiv.org PDF

Add feedback

Country:
- South America
  - Peru (0.04)
  - Uruguay (0.04)
  - Brazil (0.04)
  - Bolivia (0.04)
- North America
  - Dominican Republic (0.04)
  - Costa Rica (0.04)
  - United States
    - Michigan (0.04)
    - California (0.04)
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - Ohio > Franklin County
      - Columbus (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Massachusetts > Suffolk County
      - Boston (0.04)
    - Georgia > Fulton County
      - Atlanta (0.04)
    - Colorado > Boulder County
      - Boulder (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Switzerland > Geneva
    - Geneva (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)
- Africa > Middle East
  - Egypt > Giza Governorate > Giza (0.06)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Text Processing (1.00)
  - Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found