AITopics

2404.08259

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.05)
Asia > China > Hong Kong (0.04)
(19 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceApr-12-2024

Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding

Yang, Guangyu, Chen, Jinghong, Lin, Weizhe, Byrne, Bill

Minimum Bayes Risk (MBR) decoding can significantly improve translation performance of Multilingual Large Language Models (MLLMs). However, MBR decoding is computationally expensive. We show how the recently developed Reinforcement Learning technique, Direct Preference Optimization (DPO), can fine-tune MLLMs to get the gains of MBR without any additional computation in inference. Our method uses only a small monolingual fine-tuning set and yields significantly improved performance on multiple NMT test sets compared to MLLMs without DPO.

computational linguistic, fine-tuning, translation, (13 more...)

2311.0838

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Lou, Andrés, Pérez-Ortiz, Juan Antonio, Sánchez-Martínez, Felipe, Sánchez-Cartagena, Víctor M.

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv.

corpora, guatemala, mayan language, (13 more...)

2404.07673

Country:

North America > Mexico (0.34)
North America > Guatemala (0.28)
Europe > Spain (0.14)
(15 more...)

Genre: Research Report (0.64)

Industry:

Education (0.46)
Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Yang, Eugene, Lawrie, Dawn J., McNamee, Paul, Mayfield, James

Extending Translate-Train for ColBERT-X to African Language CLIR

This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting.

language model, translate-train, translation, (16 more...)

2404.08134

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Maryland > Baltimore (0.04)
(6 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Dipta, Shubhashis Roy, Vallurupalli, Sai

UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

The aim of SemEval-2024 Task 1, "Semantic Textual Relatedness for African and Asian Languages" is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, $\textit{TranSem}$ and $\textit{FineSem}$, for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to $1^{st}$ place for Africaans and $2^{nd}$ place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.

dataset, machine translation, relatedness, (12 more...)

2402.1273

Country:

North America > United States > Maryland > Baltimore County (0.14)
North America > United States > Maryland > Baltimore (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Ki, Dayeon, Carpuat, Marine

Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations

Machine Translation (MT) remains one of the last NLP tasks where large language models (LLMs) have not yet replaced dedicated supervised systems. This work exploits the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations. Working with LLaMA-2 models, we consider prompting strategies varying the nature of feedback provided and then fine-tune the LLM to improve its ability to exploit the provided guidance. Through experiments on Chinese-English, English-German, and English-Russian MQM data, we demonstrate that prompting LLMs to post-edit MT improves TER, BLEU and COMET scores, although the benefits of fine-grained feedback are not clear. Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation.

annotation, language pair, translation, (15 more...)

2404.07851

Country:

Asia > Singapore (0.05)
North America > United States > Maryland (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(10 more...)

Genre: Research Report > New Finding (0.93)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceApr-10-2024

Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata

Chen, Jinghong, Lin, Weizhe, Mei, Jingbiao, Byrne, Bill

The Directed Acyclic Transformer is a fast non-autoregressive (NAR) model that performs well in Neural Machine Translation. Two issues prevent its application to general Natural Language Generation (NLG) tasks: frequent Out-Of-Vocabulary (OOV) errors and the inability to faithfully generate entity names. We introduce Control-DAG, a constrained decoding algorithm for our Directed Acyclic T5 (DA-T5) model which offers lexical, vocabulary and length control. We show that Control-DAG significantly enhances DA-T5 on the Schema Guided Dialogue and the DART datasets, establishing strong NAR results for Task-Oriented Dialogue and Data-to-Text NLG.

computational linguistic, constraint, control-dag, (14 more...)

2404.06854

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(5 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.68)

Popel, Martin, Poláková, Lucie, Novák, Michal, Helcl, Jindřich, Libovický, Jindřich, Straňák, Pavel, Krabač, Tomáš, Hlaváčová, Jaroslava, Anisimova, Mariia, Chlaňová, Tereza

Charles Translator: A Machine Translation System between Ukrainian and Czech

arXiv.org Artificial IntelligenceApr-10-2024

We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in the required quality. The translator was later implemented as an online web interface and as an Android app with speech input, both featuring Cyrillic-Latin script transliteration. The system translates directly, compared to other available systems that use English as a pivot, and thus take advantage of the typological similarity of the two languages. It uses the block back-translation method, which allows for efficient use of monolingual training data. The paper describes the development process, including data collection and implementation, evaluation, mentions several use cases, and outlines possibilities for the further development of the system for educational purposes.

computational linguistic, proceedings, translation, (11 more...)

2404.06964

Country:

Europe > Ukraine (0.15)
Europe > United Kingdom (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
(14 more...)

Genre: Research Report (0.40)

Industry:

Government (0.47)
Law (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Sanchez-Bayona, Elisa, Agerri, Rodrigo

Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

arXiv.org Artificial IntelligenceApr-10-2024

According to (Lakoff and Johnson 1980), we can establish a distinction between conceptual metaphors, cognitive mappings that arise from the association between source and target domains, and linguistic metaphors, the expression of these mappings through language. The pervasiveness of metaphors in our daily speech makes it fundamental for language models to be able to process them accordingly, in order to achieve a satisfactory interaction between users and these tools. In addition, metaphor processing may have implications for other Natural Language Processing (NLP) tasks such as Machine Translation (Mao, Lin, and Guerin 2018; Schäffner 2004; Shutova, Teufel, and Korhonen 2013), political discourse analysis (Charteris-Black 2011; Prabhakaran, Rei, and Shutova 2021; Rodríguez et al. 2023) or hate speech (Lemmens, Markov, and Daelemans 2021), among others. Since in this work we study metaphor occurrence in natural language sentences, we will focus on linguistic metaphors only. The most explored task so far is metaphor detection or identification, approached as a sequence labeling task grounded on different theoretical proposals (Wilks 1975, 1978; Searle 1979; Black 1962). The methodology of most widespread use currently are the MIPVU guidelines (Steen et al. 2010), which rely on the mismatch between the basic and contextual meaning of a potential metaphor. The application of this procedure resulted in the publication of the referential dataset VUAM.

computational linguistic, dataset, metaphor, (12 more...)

2404.07053

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(20 more...)

Genre:

Overview (0.68)
Research Report (0.63)

Industry: Government (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)

Yeshpanov, Rustem, Polonskaya, Alina, Varol, Huseyin Atakan

KazParC: Kazakh Parallel Corpus for Machine Translation

arXiv.org Artificial IntelligenceApr-9-2024

We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators. Our research efforts also extend to the development of a neural machine translation model nicknamed Tilmash. Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics, such as BLEU and chrF. Both KazParC and Tilmash are openly available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

language pair, machine translation, translation, (14 more...)

2403.19399

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Education (0.68)
Information Technology (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)