Machine Translation
G-DIG: Towards Gradient-based Diverse and High-quality Instruction Data Selection for Machine Translation
Pan, Xingyuan, Huang, Luyang, Kang, Liyan, Liu, Zhicheng, Lu, Yu, Cheng, Shanbo
Large Language Models (LLMs) have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two main challenges for instruction finetuning. With regard to this, in this paper, we propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation. Our key innovation centers around analyzing how individual training examples influence the model during training. Specifically, we select training examples that exert beneficial influences on the model as high-quality ones by means of Influence Function plus a small high-quality seed dataset. Moreover, to enhance the diversity of the training data we maximize the variety of influences they have on the model by clustering on their gradients and resampling. Extensive experiments on WMT22 and FLORES translation tasks demonstrate the superiority of our methods, and in-depth analysis further validates their effectiveness and generalization.
Identifying Intensity of the Structure and Content in Tweets and the Discriminative Power of Attributes in Context with Referential Translation Machines
We use referential translation machines (RTMs) to identify the similarity between an attribute and two words in English by casting the task as machine translation performance prediction (MTPP) between the words and the attribute word and the distance between their similarities for Task 10 with stacked RTM models. RTMs are also used to predict the Figure 1: RTM depiction: parfda selects interpretants intensity of the structure and content in tweets close to the data using corpora; two MTPPS use in English, Arabic, and Spanish in Task 1 interpretants, training data, and test data to generate where MTPP is between the tweets and the features in the same space; learning and prediction use set of words for the emotion selected from these features as input. Spheres are for feature spaces.
Recent Advancements and Challenges of Turkic Central Asian Language Processing
Research in the NLP sphere of the Turkic counterparts of Central Asian languages, namely Kazakh, Uzbek, Kyrgyz, and Turkmen, comes with the typical challenges of low-resource languages, like data scarcity and a general lack of linguistic resources. However, in the recent years research has greatly advanced via collection of language-specific datasets and development of downstream task technologies. Aiming to summarize this research up until May 2024, this paper also seeks to identify potential areas of future work. To achieve this, the paper gives a broad, high-level overview of the linguistic properties of the languages, the current coverage and performance of already developed technology, application of transfer learning techniques from higher-resource languages, and availability of labeled and unlabeled data for each language. Providing a summary of the current state of affairs, we hope that further research will be facilitated with the considerations we provide in the current paper.
Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus
Khiarak, Jalil Nourmohammadi, Ahmadi, Ammar, Saeed, Taher Ak-bari, Asgari-Chenaghlu, Meysam, Atabay, Toğrul, Karimi, Mohammad Reza Baghban, Ceferli, Ismail, Hasanvand, Farzad, Mousavi, Seyed Mahboub, Noshad, Morteza
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus, designed to bridge the technological gap in language learning and machine translation (MT) for under-resourced languages. Consisting of 548,000 parallel sentences and approximately 9 million words per language, this dataset is derived from diverse sources such as news articles and holy texts, aiming to enhance natural language processing (NLP) applications and language education technology. This corpus marks a significant step forward in the realm of linguistic resources, particularly for Turkic languages, which have lagged in the neural machine translation (NMT) revolution. By presenting the first comprehensive case study for the English-Azerbaijani (Arabic Script) language pair, this work underscores the transformative potential of NMT in low-resource contexts. The development and utilization of this corpus not only facilitate the advancement of machine translation systems tailored for specific linguistic needs but also promote inclusive language learning through technology. The findings demonstrate the corpus's effectiveness in training deep learning MT systems and underscore its role as an essential asset for researchers and educators aiming to foster bilingual education and multilingual communication. This research covers the way for future explorations into NMT applications for languages lacking substantial digital resources, thereby enhancing global language education frameworks. The Python package of our code is available at https://pypi.org/project/chevir-kartalol/, and we also have a website accessible at https://translate.kartalol.com/.
The pitfalls of next-token prediction
Bachmann, Gregor, Nagarajan, Vaishnavh
Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using a simple modification that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures
NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task
Abdul-Mageed, Muhammad, Keleg, Amr, Elmadany, AbdelRahim, Zhang, Chiyu, Hamed, Injy, Magdy, Walid, Bouamor, Houda, Habash, Nizar
We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI's objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on pre-specified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask~1), identification of the Arabic level of dialectness (Subtask~2), and dialect-to-MSA machine translation (Subtask~3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask~1, three in Subtask~2, and eight in Subtask~3. The winning teams achieved 50.57 F\textsubscript{1} on Subtask~1, 0.1403 RMSE for Subtask~2, and 20.44 BLEU in Subtask~3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.
Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation
Sildam, Tiia, Velve, Andra, Alumäe, Tanel
This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models.
A Survey of Data Synthesis Approaches
Chang, Hsin-Yu, Chen, Pei-Yu, Chou, Tun-Hsiang, Kao, Chang-Sheng, Yu, Hsuan-Yun, Lin, Yen-Ting, Chen, Yun-Nung
This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on quality, 2) the evaluation of synthetic data, and 3) multi-model data augmentation.
Clear-screen translation system is being tested at Tokyo's Haneda
Japan Airlines and Toppan said Tuesday that they have begun at Tokyo's Haneda Airport a demonstration test of the clear-screen translation system developed by the printing company. The system can automatically translate spoken words into 13 languages, including English and Korean, and quickly display the translated words and sentences on its transparent screen. It also shows words entered with a keyboard. With its clear screen, the system enables speakers to talk while seeing each other's faces. It is designed to rapidly provide information to foreign travelers and people with hearing difficulties. The trial will be conducted at counters at Haneda Airport's Terminal 1 until Monday and at Osaka International Airport, also known as Itami Airport, in August.
Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation
Cavalin, Paulo, Domingues, Pedro Henrique, Pinhanez, Claudio
In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.