AITopics

2407.00436

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceJun-29-2024

Self-Translate-Train: A Simple but Strong Baseline for Cross-lingual Transfer of Large Language Models

Ri, Ryokan, Kiyono, Shun, Takase, Sho

Cross-lingual transfer is a promising technique for utilizing data in a source language to improve performance in a target language. However, current techniques often require an external translation system or suffer from suboptimal performance due to over-reliance on cross-lingual generalization of multi-lingual pretrained language models. In this study, we propose a simple yet effective method called Self-Translate-Train. It leverages the translation capability of a large language model to generate synthetic training data in the target language and fine-tunes the model with its own generated data. We evaluate the proposed method on a wide range of tasks and show substantial performance gains across several non-English languages.

computational linguistic, target language, translation, (15 more...)

2407.00454

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Texas (0.04)
(11 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Tan, Xiaoqing Ellen, Hansanti, Prangthip, Wood, Carleigh, Yu, Bokai, Ropers, Christophe, Costa-jussà, Marta R.

Towards Massive Multilingual Holistic Bias

arXiv.org Artificial IntelligenceJun-29-2024

In the current landscape of automatic language generation, there is a need to understand, evaluate, and mitigate demographic biases as existing models are becoming increasingly multilingual. To address this, we present the initial eight languages from the MASSIVE MULTILINGUAL HOLISTICBIAS (MMHB) dataset and benchmark consisting of approximately 6 million sentences representing 13 demographic axes. We propose an automatic construction methodology to further scale up MMHB sentences in terms of both language coverage and size, leveraging limited human annotation. Our approach utilizes placeholders in multilingual sentence construction and employs a systematic method to independently translate sentence patterns, nouns, and descriptors. Combined with human translation, this technique carefully designs placeholders to dynamically generate multiple sentence variations and significantly reduces the human translation workload. The translation process has been meticulously conducted to avoid an English-centric perspective and include all necessary morphological variations for languages that require them, improving from the original English HOLISTICBIAS. Finally, we utilize MMHB to report results on gender bias and added toxicity in machine translation tasks. On the gender analysis, MMHB unveils: (1) a lack of gender robustness showing almost +4 chrf points in average for masculine semantic sentences compared to feminine ones and (2) a preference to overgeneralize to masculine forms by reporting more than +12 chrf points in average when evaluating with masculine compared to feminine references. MMHB triggers added toxicity up to 2.3%.

descriptor, linguist, translation, (15 more...)

2407.00486

Country:

Asia > Singapore (0.04)
Asia > China (0.04)
North America > United States > Alaska (0.04)
(15 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJun-28-2024

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Puvvada, Krishna C., Żelasko, Piotr, Huang, He, Hrinchuk, Oleksii, Koluguri, Nithin Rao, Dhawan, Kunal, Majumdar, Somshubra, Rastorgueva, Elena, Chen, Zhehuai, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris

It was observed in [6] that such long utterances harm the model convergence. We also note that this Recent advances in speech recognition and translation rely on approach may lead to significant padding in mini-batches, resulting hundreds of thousands of hours of Internet speech data. We argue in wasted computation on non-informative frames. We that state-of-the art accuracy can be reached without relying on present an alternative approach to sampling and batching that web-scale data. Canary - multilingual ASR and speech translation allows us to iterate through data twice as fast, while balancing model, outperforms current state-of-the-art models - Whisper, different languages and data sources better. We further accelerate OWSM, and Seamless-M4T on English, French, Spanish, and the training and inference by adopting a FastConformer [7] architecture German languages, while being trained on an order of magnitude and initializing the encoder from a ASR only pretrained less data than these models. Three key factors enables such dataefficient checkpoint.

canary, dataset, wer, (12 more...)

2406.19674

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Santa Clara County > Santa Clara (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

EngadgetJun-27-2024, 12:30:09 GMT

Google uses AI to add 110 new languages to Translate

While Google Translate is far from perfect, it's still a helpful way to gain information or engage in conversation. Now, that option is expanding as Google uses AI to give Translate 110 new languages, such as Cantonese, Punjabi (Shahmukhi) and NKo. About a quarter of the languages come from Africa and Google claims that all the new ones together represent 614 million speakers -- about eight percent of the global population. Google credits its LLM, PaLM 2, as "a key piece to the puzzle, helping Translate more efficiently learn languages that are closely related to each other, including languages close to Hindi, like Awadhi and Marwadi, and French creoles like Seychellois Creole and Mauritian Creole." Isaac Caswell, Senior Software Engineer, Google Translate, adds: "As technology advances, and as we continue to partner with expert linguists and native speakers, we'll support even more language varieties and spelling conventions over time."

google translate, google use ai, new language

Engadget

Country: Africa (0.29)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.72)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Vincent, Sebastian, Prescott, Charlotte, Bayliss, Chris, Oakley, Chris, Scarton, Carolina

A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling

Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.

mtc ue, pes, translation, (14 more...)

2407.00108

Country:

Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors

Treviso, Marcos, Guerreiro, Nuno M., Agrawal, Sweta, Rei, Ricardo, Pombal, José, Vaz, Tania, Wu, Helena, Silva, Beatriz, van Stigt, Daan, Martins, André F. T.

While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for translation errors in order to guide the generation of a corrected translation. The quality of the generated explanations by xTower are assessed via both intrinsic and extrinsic evaluation. We ask expert translators to evaluate the quality of the explanations across two dimensions: relatedness towards the error span being explained and helpfulness in error understanding and improving translation quality. Extrinsically, we test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality. Our findings highlight xTower's potential towards not only producing plausible and helpful explanations of automatic translations, but also leveraging them to suggest corrected translations.

error span, explanation, translation, (15 more...)

2406.19482

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
Asia > Singapore (0.05)
North America > Canada > Ontario > Toronto (0.04)
(9 more...)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects

Ahia, Orevaoghene, Aremu, Anuoluwapo, Abagyan, Diana, Gonen, Hila, Adelani, David Ifeoluwa, Abolade, Daud, Smith, Noah A., Tsvetkov, Yulia

Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR\`ULECT dataset and models publicly under an open license.

computational linguistic, dialect, translation, (15 more...)

2406.19564

Country:

Asia > Singapore (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(29 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

Moslem, Yasmin

This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.

dataset, proceedings, translation, (12 more...)

2406.17363

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Massachusetts (0.04)
(12 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Sparse Regression for Machine Translation

Biçici, Ergun

We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of $L_1$ regularized regression (\textit{lasso}) to learn the mappings between sparsely observed feature sets versus $L_2$ regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce \textit{dice} instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that $L_1$ regularized regression performs better than $L_2$ regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.

moses, phrase table, regression, (15 more...)

2406.19478

Country:

North America > United States > Ohio > Franklin County > Columbus (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
North America > United States > Michigan (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)