AITopics

2403.14454

Country:

Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
Africa > Middle East > Egypt > Giza Governorate > Giza (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Doshi, Meet, Dabre, Raj, Bhattacharyya, Pushpak

Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

arXiv.org Artificial IntelligenceMar-21-2024

In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Indic languages and translate web-crawled monolingual documents (clean) into the target language. Then, we train language models containing 28M and 85M parameters on this translationese data (synthetic). We show that their performance on downstream natural language understanding and generative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks than LMs pre-trained on clean data. Further, we propose the use of lightweight TinyLMs pre-trained on clean data to filter synthetic data efficiently which significantly improves the performance of our models. We also find that LMs trained on synthetic data strongly benefit from extended pretraining on a tiny fraction (10%) of clean data. We release the data we collected and created as a part of this work, IndicMonoDoc, the largest collection of monolingual document-level corpora, which we hope will help bridge the gap between English and non-English performance for large language models.

computational linguistic, language model, translation, (16 more...)

2403.13638

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(18 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Mhaskar, Shivam Ratnakant, Shah, Nirmesh J., Zaki, Mohammadi, Gudmalwar, Ashishkumar P., Wasnik, Pankaj, Shah, Rajiv Ratn

Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning

arXiv.org Artificial IntelligenceMar-20-2024

Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality.

architecture, language sentence, translation, (13 more...)

2403.15469

Country:

North America > United States (0.14)
Asia > India > Karnataka > Bengaluru (0.04)
North America > Canada > Ontario > Toronto (0.04)
(3 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceMar-20-2024

A New Massive Multilingual Dataset for High-Performance Language Technologies

de Gibert, Ona, Nail, Graeme, Arefyev, Nikolay, Bañón, Marta, van der Linde, Jelmer, Ji, Shaoxiong, Zaragoza-Bernabeu, Jaume, Aulamo, Mikko, Ramírez-Sánchez, Gema, Kutuzov, Andrey, Pyysalo, Sampo, Oepen, Stephan, Tiedemann, Jörg

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

corpora, dataset, language pair, (14 more...)

2403.14009

Country:

Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > Oregon (0.04)
Europe > Spain (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Zhao, Zhixue, Aletras, Nikolaos

Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models

In many real natural language processing application scenarios, practitioners not only aim to maximize predictive performance but also seek faithful explanations for the model predictions. Rationales and importance distribution given by feature attribution methods (FAs) provide insights into how different parts of the input contribute to a prediction. Previous studies have explored how different factors affect faithfulness, mainly in the context of monolingual English models. On the other hand, the differences in FA faithfulness between multilingual and monolingual models have yet to be explored. Our extensive experiments, covering five languages and five popular FAs, show that FA faithfulness varies between multilingual and monolingual models. We find that the larger the multilingual model, the less faithful the FAs are compared to its counterpart monolingual models.Our further analysis shows that the faithfulness disparity is potentially driven by the differences between model tokenizers. Our code is available: https://github.com/casszhao/multilingual-faith.

computational linguistic, faithfulness, multilingual model, (11 more...)

2403.12809

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(16 more...)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Park, Dojun, Padó, Sebastian

Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean

Almost all frameworks for the manual or automatic evaluation of machine translation characterize the quality of an MT output with a single number. An exception is the Multidimensional Quality Metrics (MQM) framework which offers a fine-grained ontology of quality dimensions for scoring (such as style, fluency, accuracy, and terminology). Previous studies have demonstrated the feasibility of MQM annotation but there are, to our knowledge, no computational models that predict MQM scores for novel texts, due to a lack of resources. In this paper, we address these shortcomings by (a) providing a 1200-sentence MQM evaluation benchmark for the language pair English-Korean and (b) reframing MT evaluation as the multi-task problem of simultaneously predicting several MQM scores using SOTA language models, both in a reference-based MT evaluation setup and a reference-free quality estimation (QE) setup. We find that reference-free setup outperforms its counterpart in the style dimension while reference-based models retain an edge regarding accuracy. Overall, RemBERT emerges as the most promising model. Through our evaluation, we offer an insight into the translation quality in a more fine-grained, interpretable manner.

dimension, evaluation, translation, (12 more...)

2403.12666

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
(17 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Parnell, Jacob, Unanue, Inigo Jauregi, Piccardi, Massimo

SumTra: A Differentiable Pipeline for Few-Shot Cross-Lingual Summarization

Cross-lingual summarization (XLS) generates summaries in a language different from that of the input documents (e.g., English to Spanish), allowing speakers of the target language to gain a concise view of their content. In the present day, the predominant approach to this task is to take a performing, pretrained multilingual language model (LM) and fine-tune it for XLS on the language pairs of interest. However, the scarcity of fine-tuning samples makes this approach challenging in some cases. For this reason, in this paper we propose revisiting the summarize-and-translate pipeline, where the summarization and translation tasks are performed in a sequence. This approach allows reusing the many, publicly-available resources for monolingual summarization and translation, obtaining a very competitive zero-shot performance. In addition, the proposed pipeline is completely differentiable end-to-end, allowing it to take advantage of few-shot fine-tuning, where available. Experiments over two contemporary and widely adopted XLS datasets (CrossSum and WikiLingua) have shown the remarkable zero-shot performance of the proposed approach, and also its strong few-shot performance compared to an equivalent multilingual LM baseline, that the proposed approach has been able to outperform in many languages with only 10% of the fine-tuning samples.

computational linguistic, fine-tuning, summarization, (14 more...)

2403.1324

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(13 more...)

Genre: Research Report (0.64)

Industry:

Law (1.00)
Information Technology (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Resta, Michele, Bacciu, Davide

Self-generated Replay Memories for Continual Neural Machine Translation

Modern Neural Machine Translation systems exhibit strong performance in several different languages and are constantly improving. Their ability to learn continuously is, however, still severely limited by the catastrophic forgetting issue. In this work, we leverage a key property of encoder-decoder Transformers, i.e. their generative ability, to propose a novel approach to continually learning Neural Machine Translation systems. We show how this can effectively learn on a stream of experiences comprising different languages, by leveraging a replay memory populated by using the model itself as a generator of parallel sentences. We empirically demonstrate that our approach can counteract catastrophic forgetting without requiring explicit memorization of training data. Code will be publicly available upon publication. Code: https://github.com/m-resta/sg-rep

computational linguistic, proceedings, translation, (14 more...)

2403.1313

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(14 more...)

Genre: Research Report (1.00)

Industry: Education (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceMar-18-2024

What Are Tools Anyway? A Survey from the Language Model Perspective

Wang, Zhiruo, Cheng, Zhoujun, Zhu, Hao, Fried, Daniel, Neubig, Graham

Language models (LMs) are powerful yet mostly for text generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills. However, many works adopt the term "tool" in different ways, raising the question: What is a tool anyway? Subsequently, where and how do tools help LMs? In this survey, we provide a unified definition of tools as external programs used by LMs, and perform a systematic review of LM tooling scenarios and approaches. Grounded on this review, we empirically study the efficiency of various tooling methods by measuring their required compute and performance gains on various benchmarks, and highlight some challenges and potential future research in the field.

arxiv preprint arxiv, dataset, language model, (14 more...)

2403.15452

Country:

North America > United States > District of Columbia > Washington (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Singapore (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Overview (1.00)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

arXiv.org Artificial IntelligenceMar-18-2024

Adaptative Bilingual Aligning Using Multilingual Sentence Embedding

Kraif, Olivier

In this paper, we present an adaptive bitextual alignment system called AIlign. This aligner relies on sentence embeddings to extract reliable anchor points that can guide the alignment path, even for texts whose parallelism is fragmentary and not strictly monotonic. In an experiment on several datasets, we show that AIlign achieves results equivalent to the state of the art, with quasi-linear complexity. In addition, AIlign is able to handle texts whose parallelism and monotonicity properties are only satisfied locally, unlike recent systems such as Vecalign or Bertalign.

alignment, anchor point, computational linguistic, (13 more...)

2403.11921

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)
(11 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.50)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)