Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation
Kartik, Kartik, Soni, Sanjana, Kunchukuttan, Anoop, Chakraborty, Tanmoy, Akhtar, Md Shad
–arXiv.org Artificial Intelligence
The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs. Subsequently, we propose RCMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the adaptability of RCMT in a zero-shot setup for Bengalish to English translation. Our evaluation and comprehensive analyses qualitatively and quantitatively demonstrate the superiority of RCMT over state-of-the-art code-mixed and robust translation methods.
arXiv.org Artificial Intelligence
Apr-29-2024
- Country:
- Oceania > Australia
- North America
- United States
- Maryland > Baltimore (0.04)
- New York (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- California > San Diego County
- San Diego (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Wisconsin > Dane County
- Madison (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Canada > British Columbia
- United States
- Europe
- Spain (0.04)
- Slovenia (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Genre:
- Research Report (0.70)
- Technology: