Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Uhlig, Kaden, Wuebker, Joern, Reinauer, Raphael, DeNero, John
–arXiv.org Artificial Intelligence
Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying task-alignment to neural machine translation (NMT) addresses an existing task--data mismatch in NMT, leading to improvements across all languages of a multilingual model, even when task-alignment is only applied to a subset of those languages. We do so by introducing Direct Quality Optimization (DQO), a variant of DPO leveraging a pre-trained translation quality estimation model as a proxy for human preferences, and verify the improvements with both automatic metrics and human evaluation.
arXiv.org Artificial Intelligence
Sep-26-2024
- Country:
- Oceania > Australia
- North America
- United States
- Oregon > Multnomah County
- Portland (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- Oregon > Multnomah County
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Europe
- Switzerland (0.04)
- Slovenia (0.04)
- Czechia > Prague (0.04)
- Bulgaria
- Varna Province > Varna (0.04)
- Sofia City Province > Sofia (0.04)
- Italy > Tuscany
- Florence (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Middle East
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Poland > Masovia Province
- Warsaw (0.04)
- Asia
- Singapore (0.04)
- Thailand
- Middle East
- Israel (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Government (0.67)
- Technology: