Machine Translation
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
Zhang, Shaolei, Fang, Qingkai, Guo, Shoutao, Ma, Zhengrui, Zhang, Min, Feng, Yang
Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience.
Readability-guided Idiom-aware Sentence Simplification (RISS) for Chinese
Zhang, Jingshen, Chen, Xinglu, Qiu, Xinying, Wang, Zhimin, Feng, Wenhe
Chinese sentence simplification faces challenges due to the lack of large-scale labeled parallel corpora and the prevalence of idioms. To address these challenges, we propose Readability-guided Idiom-aware Sentence Simplification (RISS), a novel framework that combines data augmentation techniques with lexcial simplification. RISS introduces two key components: (1) Readability-guided Paraphrase Selection (RPS), a method for mining high-quality sentence pairs, and (2) Idiom-aware Simplification (IAS), a model that enhances the comprehension and simplification of idiomatic expressions. By integrating RPS and IAS using multi-stage and multi-task learning strategies, RISS outperforms previous state-of-the-art methods on two Chinese sentence simplification datasets. Furthermore, RISS achieves additional improvements when fine-tuned on a small labeled dataset. Our approach demonstrates the potential for more effective and accessible Chinese text simplification.
Towards Real-world Scenario: Imbalanced New Intent Discovery
Zhang, Shun, Yan, Chaoran, Yang, Jian, Liu, Jiaheng, Mo, Ying, Bai, Jiaqi, Li, Tongliang, Li, Zhoujun
New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To bridge the gap, our work introduces the imbalanced new intent discovery (i-NID) task, which seeks to identify familiar and novel intent categories within long-tailed distributions. A new benchmark (ImbaNID-Bench) comprised of three datasets is created to simulate the real-world long-tail distributions. ImbaNID-Bench ranges from broad cross-domain to specific single-domain intent categories, providing a thorough representation of practical use cases. Besides, a robust baseline model ImbaNID is proposed to achieve cluster-friendly intent representations. It includes three stages: model pre-training, generation of reliable pseudo-labels, and robust representation learning that strengthens the model performance to handle the intricacies of real-world data distributions. Our extensive experiments on previous benchmarks and the newly established benchmark demonstrate the superior performance of ImbaNID in addressing the i-NID task, highlighting its potential as a powerful baseline for uncovering and categorizing user intents in imbalanced and long-tailed distributions\footnote{\url{https://github.com/Zkdc/i-NID}}.
What is the Best Way for ChatGPT to Translate Poetry?
Wang, Shanshan, Wong, Derek F., Yao, Jingming, Chao, Lidia S.
Machine translation (MT) has historically faced significant challenges when applied to literary works, particularly in the domain of poetry translation. The advent of Large Language Models such as ChatGPT holds potential for innovation in this field. This study examines ChatGPT's capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to ascertain optimal performance. Despite promising outcomes, our analysis reveals persistent issues in the translations generated by ChatGPT that warrant attention. To address these shortcomings, we propose an Explanation-Assisted Poetry Machine Translation (EAPMT) method, which leverages monolingual poetry explanation as a guiding information for the translation process. Furthermore, we refine existing evaluation criteria to better suit the nuances of modern poetry translation. We engaged a panel of professional poets for assessments, complemented evaluations by using GPT-4. The results from both human and machine evaluations demonstrate that our EAPMT method outperforms traditional translation methods of ChatGPT and the existing online systems. This paper validates the efficacy of our method and contributes a novel perspective to machine-assisted literary translation.
Wav2Gloss: Generating Interlinear Glossed Text from Speech
He, Taiqi, Choi, Kwanghee, Tjuatja, Lindia, Robinson, Nathaniel R., Shi, Jiatong, Watanabe, Shinji, Neubig, Graham, Mortensen, David R., Levin, Lori
Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches.
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
Hwang, Min-Jae, Kulikov, Ilia, Peloquin, Benjamin, Gong, Hongyu, Chen, Peng-Jen, Lee, Ann
In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it's pretraining process. Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment. Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments.
Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms
Trabelsi, Firas, Vilar, David, Finkelstein, Mara, Freitag, Markus
Minimum Bayes Risk (MBR) decoding is a powerful decoding strategy widely used for text generation tasks, but its quadratic computational complexity limits its practical application. This paper presents a novel approach for approximating MBR decoding using matrix completion techniques, focusing on the task of machine translation. We formulate MBR decoding as a matrix completion problem, where the utility metric scores between candidate hypotheses and pseudo-reference translations form a low-rank matrix. First, we empirically show that the scores matrices indeed have a low-rank structure. Then, we exploit this by only computing a random subset of the scores and efficiently recover the missing entries in the matrix by applying the Alternating Least Squares (ALS) algorithm, thereby enabling a fast approximation of the MBR decoding process. Our experimental results on machine translation tasks demonstrate that the proposed method requires 1/16 utility metric computations compared to vanilla MBR decoding while achieving equal translation quality measured by COMET22 on the WMT22 dataset (en<>de and en<>ru). We also benchmark our method against other approximation methods and we show gains in quality when comparing to them.
Self-Modifying State Modeling for Simultaneous Machine Translation
Yu, Donglei, Kang, Xiaomian, Liu, Yuchen, Zhou, Yu, Zong, Chengqing
Simultaneous Machine Translation (SiMT) generates target outputs while receiving stream source inputs and requires a read/write policy to decide whether to wait for the next source token or generate a new target token, whose decisions form a \textit{decision path}. Existing SiMT methods, which learn the policy by exploring various decision paths in training, face inherent limitations. These methods not only fail to precisely optimize the policy due to the inability to accurately assess the individual impact of each decision on SiMT performance, but also cannot sufficiently explore all potential paths because of their vast number. Besides, building decision paths requires unidirectional encoders to simulate streaming source inputs, which impairs the translation quality of SiMT models. To solve these issues, we propose \textbf{S}elf-\textbf{M}odifying \textbf{S}tate \textbf{M}odeling (SM$^2$), a novel training paradigm for SiMT task. Without building decision paths, SM$^2$ individually optimizes decisions at each state during training. To precisely optimize the policy, SM$^2$ introduces Self-Modifying process to independently assess and adjust decisions at each state. For sufficient exploration, SM$^2$ proposes Prefix Sampling to efficiently traverse all potential states. Moreover, SM$^2$ ensures compatibility with bidirectional encoders, thus achieving higher translation quality. Experiments show that SM$^2$ outperforms strong baselines. Furthermore, SM$^2$ allows offline machine translation models to acquire SiMT ability with fine-tuning.
MultiMax: Sparse and Multi-Modal Attention Learning
Zhou, Yuxuan, Fritz, Mario, Keuper, Margret
SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to the Argmax function, a significant amount of probability mass is distributed to other, residual entries, leading to poor interpretability and noise. Although sparsity can be achieved by a family of SoftMax variants, they often require an alternative loss function and do not preserve multi-modality. We show that this trade-off between multi-modality and sparsity limits the expressivity of SoftMax as well as its variants. We provide a solution to this tension between objectives by proposing a piece-wise differentiable function, termed MultiMax, which adaptively modulates the output distribution according to input entry range. Through comprehensive analysis and evaluation, we show that MultiMax successfully produces a distribution that supresses irrelevant entries while preserving multimodality, with benefits in image classification, language modeling and machine translation. The code is available at https://github.com/ZhouYuxuanYX/MultiMax.
Deterministic Reversible Data Augmentation for Neural Machine Translation
Yao, Jiashu, Huang, Heyan, Liu, Zeming, Guo, Yuhang
Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.