Moradshahi, Mehrad
Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
Lee, Andrew H., Semnani, Sina J., Castillo-López, Galo, de Chalendar, Gäel, Choudhury, Monojit, Dua, Ashna, Kavitha, Kapil Rajesh, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Lombard, Alexis, Moradshahi, Mehrad, Park, Gihyun, Semmar, Nasredine, Seo, Jiwon, Shen, Tianhao, Shrivastava, Manish, Xiong, Deyi, Lam, Monica S.
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents
Moradshahi, Mehrad, Shen, Tianhao, Bali, Kalika, Choudhury, Monojit, de Chalendar, Gaël, Goel, Anmol, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Semmar, Nasredine, Semnani, Sina J., Seo, Jiwon, Seshadri, Vivek, Shrivastava, Manish, Sun, Michael, Yadavalli, Aditya, You, Chaobin, Xiong, Deyi, Lam, Monica S.
Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset, code, and toolkit are released open-source.
Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues
Moradshahi, Mehrad, Tsai, Victoria, Campagna, Giovanni, Lam, Monica S.
Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice. We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can lead to misguided judgments on the quality of the model. Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations. We release our datasets and software open source.
Zero and Few-Shot Localization of Task-Oriented Dialogue Agents with a Distilled Representation
Moradshahi, Mehrad, Semnani, Sina J., Lam, Monica S.
Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent. We propose automatic methods that use ToD training data in a source language to build a high-quality functioning dialogue agent in another target language that has no training data (i.e. zero-shot) or a small training set (i.e. few-shot). Unlike most prior work in cross-lingual ToD that only focuses on Dialogue State Tracking (DST), we build an end-to-end agent. We show that our approach closes the accuracy gap between few-shot and existing full-shot methods for ToD agents. We achieve this by (1) improving the dialogue data representation, (2) improving entity-aware machine translation, and (3) automatic filtering of noisy translations. We evaluate our approach on the recent bilingual dialogue dataset BiToD. In Chinese to English transfer, in the zero-shot setting, our method achieves 46.7% and 22.0% in Task Success Rate (TSR) and Dialogue Success Rate (DSR) respectively. In the few-shot setting where 10% of the data in the target language is used, we improve the state-of-the-art by 15.2% and 14.0%, coming within 5% of full-shot training.
Investigating Effect of Dialogue History in Multilingual Task Oriented Dialogue Systems
Sun, Michael, Huang, Kaili, Moradshahi, Mehrad
While the English virtual assistants have achieved exciting performance with an enormous amount of training resources, the needs of non-English-speakers have not been satisfied well. Up to Dec 2021, Alexa, one of the most popular smart speakers around the world, is able to support 9 different languages [1], while there are thousands of languages in the world, 91 of which are spoken by more than 10 million people according to statistics published in 2019 [2]. However, training a virtual assistant in other languages than English is often more difficult, especially for those low-resource languages. The lack of high-quality training data restricts the performance of models, resulting in poor user satisfaction. Therefore, we devise an efficient and effective training solution for multilingual task-orientated dialogue systems, using the same dataset generation pipeline and end-to-end dialogue system architecture as BiToD[5], which adopted some key design choices for a minimalistic natural language design where formal dialogue states are used in place of natural language inputs. This reduces the room for error brought by weaker natural language models, and ensures the model can correctly extract the essential slot values needed to perform dialogue state tracking (DST). Our goal is to reduce the amount of natural language encoded at each turn, and the key parameter we investigate is the number of turns (H) to feed as history to model. We first explore the turning point where increasing H begins to yield limiting returns on the overall performance. Then we examine whether the examples a model with small H gets wrong can be categorized in a way for the model to do few-shot finetuning on. Lastly, will explore the limitations of this approach, and whether there is a certain type of examples that this approach will not be able to resolve.
HUBERT Untangles BERT to Improve Transfer across NLP Tasks
Moradshahi, Mehrad, Palangi, Hamid, Lam, Monica S., Smolensky, Paul, Gao, Jianfeng
We show that there is shared structure between different NLP datasets that HUBERT, but not BERT, is able to learn and leverage. Our experiment results show that untangling data-specific semantics from general language structure is key for better transfer among NLP tasks. Built on the Transformer architecture (V aswani et al., 2017), the BERT model (Devlin et al., 2018) has demonstrated great power for providing general-purpose vector embeddings of natural language: its representations have served as the basis of many successful deep Natural Language Processing (NLP) models on a variety of tasks (e.g., Liu et al., 2019a;b; Zhang et al., 2019). Recent studies (Coenen et al., 2019; Hewitt & Manning, 2019; Lin et al., 2019; Tenney et al., 2019) have shown that BERT representations carry considerable information about grammatical structure, which, by design, is a deep and general encapsulation of linguistic information. Symbolic computation over structured symbolic representations such as parse trees has long been used to formalize linguistic knowledge. To strengthen the generality of BERT's representations, we propose to import into its architecture this type of computation. Symbolic linguistic representations support the important distinction between content and form information. The form consists of a structure devoid of content, such as an unlabeled tree, a collection of nodes defined by their structural positions or roles (Newell, 1980), such as root, left-child-of-root, right-child-of-left-child-of root, etc. In a particular linguistic expression such as "Kim referred to herself during the speech", these purely-structural roles are filled with particular content-bearing symbols, including terminal words like Kim and non-terminal categories like NounPhrase . These role fillers have their own identities, which are preserved as they move from role to role across expressions: Kim retains its referent and its semantic properties whether it fills the subject or the object role in a sentence.
Language Modeling with Generative AdversarialNetworks
Moradshahi, Mehrad, Contractor, Utkarsh
Generative Adversarial Networks (GANs) have been promising in the field of image generation, however, they have been hard to train for language generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them which causes high levels of instability in training GANs. Consequently, past work has resorted to pre-training with maximum-likelihood or training GANs without pre-training with a WGAN objective with a gradient penalty. In this study, we present a comparison of those approaches. Furthermore, we present the results of some experiments that indicate better training and convergence of Wasserstein GANs (WGANs) when a weaker regularization term is enforcing the Lipschitz constraint.