Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems
Hu, Songbo, Zhou, Han, Hergul, Mete, Gritta, Milan, Zhang, Guchun, Iacobacci, Ignacio, Vulić, Ivan, Korhonen, Anna
–arXiv.org Artificial Intelligence
Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom-up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.
arXiv.org Artificial Intelligence
Jul-26-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > San Francisco County
- San Francisco (0.14)
- Washington > King County
- Canada
- Ontario > Toronto (0.04)
- Alberta > Census Division No. 6
- Calgary Metropolitan Region > Calgary (0.04)
- Europe
- Spain (0.04)
- Italy (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Greater London > London (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Netherlands > South Holland
- Dordrecht (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Germany > Saarland
- Saarbrücken (0.04)
- Asia
- Indonesia > Bali (0.04)
- Vietnam > Hanoi
- Hanoi (0.04)
- Middle East
- Israel (0.04)
- UAE > Dubai Emirate
- Dubai (0.04)
- Republic of Türkiye > Ankara Province
- Ankara (0.04)
- India > Karnataka
- Bengaluru (0.04)
- China
- Shanghai > Shanghai (0.04)
- Hong Kong (0.04)
- Shandong Province > Qingdao (0.04)
- North America
- Genre:
- Overview (0.68)
- Research Report (0.64)
- Industry:
- Consumer Products & Services (0.46)
- Information Technology > Security & Privacy (0.46)
- Technology: