Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets
Tiedemann, Jörg, Nakov, Preslav
–arXiv.org Artificial Intelligence
This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such character-level models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word-nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.
arXiv.org Artificial Intelligence
Sep-27-2021
- Country:
- Africa > Middle East
- Egypt > Giza Governorate > Giza (0.04)
- Asia
- China > Beijing
- Beijing (0.04)
- India > Maharashtra
- Mumbai (0.04)
- Japan > Hokkaidō
- Hokkaidō Prefecture > Sapporo (0.04)
- Middle East
- Qatar > Ad-Dawhah
- Doha (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Qatar > Ad-Dawhah
- Singapore (0.04)
- South Korea (0.04)
- China > Beijing
- Europe
- Czechia > Prague (0.05)
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Belgium > Flanders
- Flemish Brabant > Leuven (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- France (0.04)
- Italy > Liguria
- Genoa (0.04)
- Sweden > Uppsala County
- Uppsala (0.04)
- Bulgaria (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- North America
- Canada
- Ontario > National Capital Region
- Ottawa (0.04)
- Quebec > Montreal (0.04)
- Ontario > National Capital Region
- United States
- Florida > Orange County
- Orlando (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New York > Monroe County
- Rochester (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Orange County
- Canada
- Africa > Middle East
- Genre:
- Research Report (0.70)
- Technology: