Hamza, Wael
CALICO: Conversational Agent Localization via Synthetic Data Generation
Rosenbaum, Andy, Kharazmi, Pegah, Banijamali, Ershad, Zeng, Lu, DiPersio, Christopher, Wei, Pan, Oz, Gokmen, Chung, Clement, Owczarzak, Karolina, Triefenbach, Fabian, Hamza, Wael
We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For slots (named entities), CALICO supports three operations: verbatim copy, literal translation, and localization, i.e. generating slot values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. Furthermore, we design an iterative filtering mechanism to discard noisy generated samples, which we show boosts the performance of the downstream conversational agent. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 8 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized slots which are closer to the HL test set.
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks
Everson, Kevin, Gu, Yile, Yang, Huck, Shivakumar, Prashanth Gurunath, Lin, Guan-Ting, Kolehmainen, Jari, Bulyko, Ivan, Gandhe, Ankur, Ghosh, Shalini, Hamza, Wael, Lee, Hung-yi, Rastrow, Ariya, Stolcke, Andreas
In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.
Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models
Soltan, Saleh, Rosenbaum, Andy, Falke, Tobias, Lu, Qin, Rumshisky, Anna, Hamza, Wael
Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Lialin, Vladislav, Rawls, Stephen, Chan, David, Ghosh, Shalini, Rumshisky, Anna, Hamza, Wael
Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.
Low-Resource Compositional Semantic Parsing with Concept Pretraining
Rongali, Subendhu, Sridhar, Mukund, Khan, Haidar, Arkoudas, Konstantine, Hamza, Wael, McCallum, Andrew
Semantic parsing plays a key role in digital voice assistants such as Alexa, Siri, and Google Assistant by mapping natural language to structured meaning representations. When we want to improve the capabilities of a voice assistant by adding a new domain, the underlying semantic parsing model needs to be retrained using thousands of annotated examples from the new domain, which is time-consuming and expensive. In this work, we present an architecture to perform such domain adaptation automatically, with only a small amount of metadata about the new domain and without any new training data (zero-shot) or with very few examples (few-shot). We use a base seq2seq (sequence-to-sequence) architecture and augment it with a concept encoder that encodes intent and slot tags from the new domain. We also introduce a novel decoder-focused approach to pretrain seq2seq models to be concept aware using Wikidata and use it to help our model learn important concepts and perform well in low-resource settings. We report few-shot and zero-shot results for compositional semantic parsing on the TOPv2 dataset and show that our model outperforms prior approaches in few-shot settings for the TOPv2 and SNIPS datasets.
Delexicalized Paraphrase Generation
Yu, Boya, Arkoudas, Konstantine, Hamza, Wael
We present a neural model for paraphrasing and train it to generate delexicalized sentences. We achieve this by creating training data in which each input is paired with a number of reference paraphrases. These sets of reference paraphrases represent a weak type of semantic equivalence based on annotated slots and intents. To understand semantics from different types of slots, other than anonymizing slots, we apply convolutional neural networks (CNN) prior to pooling on slot values and use pointers to locate slots in the output. We show empirically that the generated paraphrases are of high quality, leading to an additional 1.29% exact match on live utterances. We also show that natural language understanding (NLU) tasks, such as intent classification and named entity recognition, can benefit from data augmentation using automatically generated paraphrases.
Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding
Cao, Jin, Wang, Jun, Hamza, Wael, Vanee, Kelly, Li, Shang-Wen
Neural models have yielded state-of-the-art results in deciphering spoken language understanding (SLU) problems; however, these models require a significant amount of domain-specific labeled examples for training, which is prohibitively expensive. While pre-trained language models like BERT have been shown to capture a massive amount of knowledge by learning from unlabeled corpora and solve SLU using fewer labeled examples for adaption, the encoding of knowledge is implicit and agnostic to downstream tasks. Such encoding results in model inefficiencies in parameter usage: an entirely new model is required for every domain. To address these challenges, we introduce a novel SLU framework, comprising a conversational language modeling (CLM) pre-training task and a light encoder architecture. The CLM pre-training enables networks to capture the representation of the language in conversation style with the presence of ASR errors. The light encoder architecture separates the shared pre-trained networks from the mappings of generally encoded knowledge to specific domains of SLU, allowing for the domain adaptation to be performed solely at the light encoder and thus increasing efficiency. With the framework, we match the performance of state-of-the-art SLU results on Alexa internal datasets and on two public ones (ATIS, SNIPS), adding only 4.4% parameters per task.
Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based Decoding
Zhu, Qile, Khan, Haidar, Soltan, Saleh, Rawls, Stephen, Hamza, Wael
Semantic parsing is one of the key components of natural language understanding systems. A successful parse transforms an input utterance to an action that is easily understood by the system. Many algorithms have been proposed to solve this problem, from conventional rulebased or statistical slot-filling systems to shiftreduce based neural parsers. For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly. This model is slow at inference time, generating parses in O(n) decoding steps (n is the length of the target sequence). In addition, we demonstrate that this method performs poorly in zero-shot cross-lingual transfer learning settings. In this paper, we propose a non-autoregressive parser which is based on the insertion transformer to overcome these two issues. Our approach 1) speeds up decoding by 3x while outperforming the autoregressive model and 2) significantly improves cross-lingual transfer in the low-resource setting by 37% compared to autoregressive baseline. We test our approach on three well-known monolingual datasets: ATIS, SNIPS and TOP. For cross lingual semantic parsing, we use the MultiATIS++ and the multilingual TOP datasets.
Neural Cross-Lingual Entity Linking
Sil, Avirup (IBM Research AI) | Kundu, Gourab (IBM) | Florian, Radu (IBM) | Hamza, Wael (IBM)
A major challenge in Entity Linking (EL) is making effective use of contextual information to disambiguate mentions to Wikipedia that might refer to different entities in different contexts. The problem exacerbates with cross-lingual EL which involves linking mentions written in non-English documents to entries in the English Wikipedia: to compare textual clues across languages we need to compute similarity between textual fragments across languages. In this paper, we propose a neural EL model that trains fine-grained similarities and dissimilarities between the query and candidate document from multiple perspectives, combined with convolution and tensor networks. Further, we show that this English-trained system can be applied, in zero-shot learning, to other languages by making surprisingly effective use of multi-lingual embeddings. The proposed system has strong empirical evidence yielding state-of-the-art results in English as well as cross-lingual: Spanish and Chinese TAC 2015 datasets.
Bilateral Multi-Perspective Matching for Natural Language Sentences
Wang, Zhiguo, Hamza, Wael, Florian, Radu
Natural language sentence matching is a fundamental technology for a variety of tasks. Previous approaches either match sentences from a single direction or only apply single granular (word-by-word or sentence-by-sentence) matching. In this work, we propose a bilateral multi-perspective matching (BiMPM) model under the "matching-aggregation" framework. Given two sentences $P$ and $Q$, our model first encodes them with a BiLSTM encoder. Next, we match the two encoded sentences in two directions $P \rightarrow Q$ and $P \leftarrow Q$. In each matching direction, each time step of one sentence is matched against all time-steps of the other sentence from multiple perspectives. Then, another BiLSTM layer is utilized to aggregate the matching results into a fix-length matching vector. Finally, based on the matching vector, the decision is made through a fully connected layer. We evaluate our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.