Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation

Jones, Alex, Caswell, Isaac, Saxena, Ishank, Firat, Orhan

Mar-27-2023–arXiv.org Artificial Intelligence

Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Neural machine translation (NMT) has emerged as the dominant way of training machine translation models (Bahdanau ...

artificial intelligence, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

Mar-27-2023

arXiv.org PDF

Add feedback

Country:
- Oceania (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - California > San Diego County
      - San Diego (0.04)
  - Canada > Ontario
    - National Capital Region > Ottawa (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - India (0.04)
  - Thailand > Pattani
    - Pattani (0.04)
  - Philippines > Luzon
    - Ilocos Region > Province of Pangasinan (0.04)
  - Middle East
    - Republic of Türkiye (0.04)
    - Israel (0.04)
  - China > Beijing
    - Beijing (0.04)
- Africa
  - Niger (0.05)
  - Uganda (0.04)

Genre:
- Research Report > New Finding (0.92)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found