Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
Sakai, Yusuke, Makinae, Mana, Kamigaito, Hidetaka, Watanabe, Taro
–arXiv.org Artificial Intelligence
In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{https://github.com/yusuke1997/LLM-SI-Corpus}.
arXiv.org Artificial Intelligence
Apr-18-2024
- Country:
- Oceania > Australia
- North America
- Dominican Republic (0.04)
- United States
- New York (0.04)
- California (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Canada > Ontario
- Toronto (0.04)
- Europe
- Monaco (0.04)
- Bulgaria > Varna Province
- Varna (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Middle East > Republic of Türkiye
- Istanbul Province > Istanbul (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Singapore (0.04)
- China (0.04)
- Vietnam > Da Nang
- Da Nang (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Japan
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū
- Kansai > Kyoto Prefecture
- Kyoto (0.04)
- Chūbu > Toyama Prefecture
- Toyama (0.04)
- Kansai > Kyoto Prefecture
- Kyūshū & Okinawa > Kyūshū
- Genre:
- Research Report (0.82)
- Technology: