Machine Translation
Export Reviews, Discussions, Author Feedback and Meta-Reviews
Recent work on neural machine translation and other text generation tasks has trained models directly to minimize perplexity/negative log-likelihood of observed sequences. While this has shown very promising results, the setup ignores the fact that in practice the model is conditioning on generated symbols as opposed to gold symbols, and may therefore be conditioning on contexts that are quite different from the contexts seen in the gold data. This paper attempts to remedy this problem with by utilizing generated sequences at training time. Instead of conditioning on the gold context it utilizes the generated context. Unfortunately at early rounds of the algorithm this produces junk, so they introduce a "scheduled sampling" approach that alternates between the two training methods based on a predefined decay schedule inspired by curriculum learning.
Large Multimodal Models for Low-Resource Languages: A Survey
Lupascu, Marian, Rogoz, Ana-Cristina, Stupariu, Mihai Sorin, Ionescu, Radu Tudor
In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 106 studies across 75 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. We aim to provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.
ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data
Liu, Xiaoyang, Bao, Kangjie, Zhang, Jiashuo, Liu, Yunqi, Chen, Yu, Liu, Yuntian, Jiao, Yang, Luo, Tao
Autoformalization, the process of automatically translating natural language mathematics into machine-verifiable formal language, has demonstrated advancements with the progress of large language models (LLMs). However, a key obstacle to further advancements is the scarcity of paired datasets that align natural language with formal language. To address this challenge, we introduce ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), an iterative data generation framework designed to produce large-scale, high-quality parallel theorem statements. With the proposed ATLAS running for 10 iterations, we construct an undergraduate-level dataset comprising 300k theorem statements and develop the ATLAS translator, achieving accuracies of 80.59% (pass@8) and 92.99% (pass@128) on ProofNet, significantly outperforming the base model (23.99% and 47.17%) and InternLM2-Math-Plus-7B (50.94% and 80.32%). Furthermore, the ATLAS translator also achieves state-of-the-art performance on both the high-school-level miniF2F dataset and the graduate-level MathQual dataset introduced in this work. The datasets, model, and code will be released to the public soon.
Review for NeurIPS paper: Unsupervised Translation of Programming Languages
Reviewers agree that this paper is a significant advance in the problem of language translation. One lingering concern is with the positioning of the paper. In particular, the introduction needs to do a better job in recognizing that this paper focuses on small self-contained units of code. In order to be useful in a software engineering context, a translation tool would have to address a number of problems that are not addressed by this work, such as major differences in the design patterns used by APIs in different languages. Without a proper acknowledgment of the limitations of the approach early in the paper, this paper could make it difficult to publish follow-up work.
Review for NeurIPS paper: Estimating Training Data Influence by Tracing Gradient Descent
Weaknesses: I have some major concerns with the evaluation part of the paper. A simple baseline could be a loss based selection method. Simply select training points based on loss change. A recent paper [DataLens IJCNN 20] shows that a simple loss based selection outperforms both influence functions and representer selection on mislabelled data identification when the mislabeled data is small. As the fraction of mislabelled data increases, influence function works better than loss based method.
Meta and UNESCO team up to improve translation AI
Meta has partnered with UNESCO on a new plan to improve translation and speech recognition AI, Techcrunch reported. As part of its Language Technology Partner Program, Meta is seeking collaborators willing to donate at least 10 hours of speech recordings with transcriptions, large written texts (200-plus sentences) and sets of translated sentences. The aim is to focus on "underserved languages, in support of UNESCO's work," Meta wrote in a blog post. So far, Meta and UNESCO have signed on the government of Nunavut, a northern Canadian territory. The aim is to develop translation systems for the Intuit languages used there, Inuktitut and Inuinnaqtun.
Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Cui, Menglong, Gao, Pengzhi, Liu, Wei, Luan, Jian, Wang, Bin
Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA and achieves competitive performance with Google Translate and GPT-4-turbo.
Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources
Ticona, Belu, Carranza, Fernando, Cotik, Viviana
Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
The paper tackles (constituent) syntactic parsing by mapping this prediction problem to a sequence-to-sequence alignment problem, and then essentially applying a method recently developed in the context of neural machine translation (LSTM-encoder-decoder with an attention mechanism). The resulting parsing model achieves state-of-the-art results when used in the standard supervised set-up (PTB WSJ) and improves further when estimated in a semi-supervised / co-training regime. What I find especially interesting in this paper is that the attention mechanism is crucial for attaining good generalization properties: without using the attention mechanism LSTM achieves very poor results in the supervised setting. This is an interesting observation which may in principle generate future work focusing on refining the attention model (e.g., moving more in a direction of Neural Turing machines of Graves et al.). This is also somewhat surprising that such simple linearization strategy led to state-of-the-art performance.
Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation
Huang, Chenyang, Huang, Fei, Zheng, Zaixiang, Zaïane, Osmar R., Zhou, Hao, Mou, Lili
Multilingual neural machine translation (MNMT) aims at using one single model for multiple translation directions. Recent work applies non-autoregressive Transformers to improve the efficiency of MNMT, but requires expensive knowledge distillation (KD) processes. To this end, we propose an M-DAT approach to non-autoregressive multilingual machine translation. Our system leverages the recent advance of the directed acyclic Transformer (DAT), which does not require KD. We further propose a pivot back-translation (PivotBT) approach to improve the generalization to unseen translation directions. Experiments show that our M-DAT achieves state-of-the-art performance in non-autoregressive MNMT.