Machine Translation
Learning and Analyzing Generation Order for Undirected Sequence Models
Undirected neural sequence models have achieved performance competitive with the state-of-the-art directed sequence models that generate monotonically from left to right in machine translation tasks. In this work, we train a policy that learns the generation order for a pre-trained, undirected translation model via reinforcement learning. We show that the translations decoded by our learned orders achieve higher BLEU scores than the outputs decoded from left to right or decoded by the learned order from Mansimov et al. (2019) on the WMT'14 German-English translation task. On examples with a maximum source and target length of 30 from De-En, WMT'16 English-Romanian, and WMT'21 English-Chinese translation tasks, our learned order outperforms all heuristic generation orders on four out of six tasks. We next carefully analyze the learned order patterns via qualitative and quantitative analysis. We show that our policy generally follows an outer-to-inner order, predicting the left-most and right-most positions first, and then moving toward the middle while skipping less important words at the beginning. Furthermore, the policy usually predicts positions for a single syntactic constituent structure in consecutive steps. We believe our findings could provide more insights on the mechanism of undirected generation models and encourage further research in this direction. Our code is publicly available at https://github.com/jiangycTarheel/undirected-generation
Adapting Document-Grounded Dialog Systems to Spoken Conversations using Data Augmentation and a Noisy Channel Model
Thulke, David, Daheim, Nico, Dugast, Christian, Ney, Hermann
This paper summarizes our submission to Task 2 of the second track of the 10th Dialog System Technology Challenge (DSTC10) "Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations". Similar to the previous year's iteration, the task consists of three subtasks: detecting whether a turn is knowledge seeking, selecting the relevant knowledge document and finally generating a grounded response. This year, the focus lies on adapting the system to noisy ASR transcripts. We explore different approaches to make the models more robust to this type of input and to adapt the generated responses to the style of spoken conversations. For the latter, we get the best results with a noisy channel model that additionally reduces the number of short and generic responses. Our best system achieved the 1st rank in the automatic and the 3rd rank in the human evaluation of the challenge.
Chinese TV introducing AI sign language presenter at the next Olympics
Chinese TV will introduce the first AI sign language presenter in time for the 2022 Winter Olympics in Beijing. China Central Television (CCTV) and Baidu AI Cloud said the launch of the AI sign language presenter represents a huge leap forwards in'overcoming the barrier of sound with technology'. Nearly 28 million people in China are hearing impaired and about 430 million around the world also suffer from hearing loss. The launch of the AI presenter will allow the state broadcaster to include sign language services for viewers around the clock, and will start by giving updates of the Winter Olympics in Beijing early next year. The presenter achieves high-level sign language expression thanks to Baidu's natural action engine and their sign language translation engine.
Applying SoftTriple Loss for Supervised Language Model Fine Tuning
Sosnowski, Witold, Wroblewska, Anna, Gawrysiak, Piotr
Natural language processing (NLP) is a rapidly growing area of machine learning with applications wherever a computer needs to operate on a text that involves capturing its semantics. It may include text classification, translation, text summarization, question answering, dialogues. All these tasks are upstream and depend on the quality of the text representation (White et al., 2015). Many models can produce such text representations, from Bag-Of-Word or Word2Vec word embedding to the state-of-the-art language representation model BERT with variations in most NLP tasks. The best performance on text classification tasks is obtained when the model is first trained on a general knowledge corpus to capture semantic relationships between words and then fine-tuned with an additional dense layer on a domain corpus with cross-entropy loss (Radford et al., 2019). We introduce a new loss function TripleEntropy to improve classification performance for fine-tuning general knowledge pre-trained language models based on cross-entropy loss and SoftTriple loss (Devlin et al., 2018; Qian et al., 2019). Triplet Loss transforms the embedding space so that vector representations from the same class can form separable subspaces, stabilizing, and generalizing the language model fine-tuning process. TripleEntropy can improve the fine-tuning process of the RoBERTa based models so the performance on downstream task increases by about (0.02% - 2.29%).
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Wang, Zhecan, You, Haoxuan, Li, Liunian Harold, Zareian, Alireza, Park, Suji, Liang, Yiqing, Chang, Kai-Wei, Chang, Shih-Fu
Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.
Textless Speech-to-Speech Translation on Real Data
Lee, Ann, Gong, Hongyu, Duquenne, Paul-Ambroise, Schwenk, Holger, Chen, Peng-Jen, Wang, Changhan, Popuri, Sravya, Pino, Juan, Gu, Jiatao, Hsu, Wei-Ning
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the \vp~S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.
Learning Cross-Lingual IR from an English Retriever
Li, Yulong, Franz, Martin, Sultan, Md Arafat, Iyer, Bhavani, Lee, Young-Suk, Sil, Avirup
We present a new cross-lingual information retrieval (CLIR) model trained using multi-stage knowledge distillation (KD). The teacher and the student are heterogeneous systems-the former is a pipeline that relies on machine translation and monolingual IR, while the latter executes a single CLIR operation. We show that the student can learn both multilingual representations and CLIR by optimizing two corresponding KD objectives. Learning multilingual representations from an English-only retriever is accomplished using a novel cross-lingual alignment algorithm that greedily re-positions the teacher tokens for alignment. Evaluation on the XOR-TyDi benchmark shows that the proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.
Learning Non-Monotonic Automatic Post-Editing of Translations from Human Orderings
Gรณis, Antรณnio, Cho, Kyunghyun, Martins, Andrรฉ
Recent research in neural machine translation has explored flexible generation orders, as an alternative to left-to-right generation. However, training non-monotonic models brings a new complication: how to search for a good ordering when there is a combinatorial explosion of orderings arriving at the same final result? Also, how do these automatic orderings compare with the actual behaviour of human translators? Current models rely on manually built biases or are left to explore all possibilities on their own. In this paper, we analyze the orderings produced by human post-editors and use them to train an automatic post-editing system. We compare the resulting system with those trained with left-to-right and random post-editing orderings. We observe that humans tend to follow a nearly left-to-right order, but with interesting deviations, such as preferring to start by correcting punctuation or verbs.
Azure AI empowers organizations to serve users in more than 100 languages
Microsoft announced today that 12 new languages and dialects have been added to Translator. These additions mean that the service can now translate between more than 100 languages and dialects, making information in text and documents accessible to 5.66 billion people worldwide. "One hundred languages is a good milestone for us to achieve our ambition for everyone to be able to communicate regardless of the language they speak," said Xuedong Huang, Microsoft technical fellow and Azure AI chief technology officer. Translator today covers the world's most spoken languages including English, Chinese, Hindi, Arabic and Spanish. In recent years, advances in AI technology have allowed the company to grow its language library with low-resource and endangered languages, such as Inuktitut, a dialect of Inuktut that is spoken by about 40,000 Inuit in Canada.
Maximum Bayes Smatch Ensemble Distillation for AMR Parsing
Lee, Young-Suk, Astudillo, Ramon Fernandez, Hoang, Thanh Lam, Naseem, Tahira, Florian, Radu, Roukos, Salim
AMR parsing has experienced an unprecendented increase in performance in the last three years, due to a mixture of effects including architecture improvements and transfer learning. Self-learning techniques have also played a role in pushing performance forward. However, for most recent high performant parsers, the effect of self-learning and silver data generation seems to be fading. In this paper we show that it is possible to overcome this diminishing returns of silver data by combining Smatch-based ensembling techniques with ensemble distillation. In an extensive experimental setup, we push single model English parser performance above 85 Smatch for the first time and return to substantial gains. We also attain a new state-of-the-art for cross-lingual AMR parsing for Chinese, German, Italian and Spanish. Finally we explore the impact of the proposed distillation technique on domain adaptation, and show that it can produce gains rivaling those of human annotated data for QALD-9 and achieve a new state-of-the-art for BioAMR.