Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting
Liu, Emmy, Chaudhary, Aditi, Neubig, Graham
–arXiv.org Artificial Intelligence
Idioms are common in everyday language, but often pose a challenge to translators because their meanings do not follow from the meanings of their parts. Despite significant advances, machine translation systems still struggle to translate idiomatic expressions. We provide a simple characterization of idiomatic translation and related issues. This allows us to conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To expand multilingual resources, we compile a dataset of ~4k natural sentences containing idiomatic expressions in French, Finnish, and Japanese. To improve translation of natural idioms, we introduce two straightforward yet effective techniques: the strategic upweighting of training loss on potentially idiomatic sentences, and using retrieval-augmented models. This not only improves the accuracy of a strong pretrained MT model on idiomatic sentences by up to 13% in absolute accuracy, but also holds potential benefits for non-idiomatic sentences.
arXiv.org Artificial Intelligence
Oct-20-2023
- Country:
- Africa > Nigeria (0.04)
- South America > Chile
- North America > United States
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Spain > Canary Islands
- Gran Canaria > Las Palmas de Gran Canaria (0.04)
- Portugal > Lisbon
- Lisbon (0.14)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- United Kingdom > England
- Asia > Japan
- Kyūshū & Okinawa > Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
- Genre:
- Research Report > Experimental Study (0.68)
- Technology: