AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

A Bayesian Optimization Approach to Machine Translation Reranking

Cheng, Julius, Züfle, Maike, Zouhar, Vilém, Vlachos, Andreas

arXiv.org Artificial IntelligenceNov-14-2024

Reranking a list of candidates from a machine translation system with an external scoring model and returning the highest-scoring candidate remains a simple and effective method for improving the overall output quality. Translation scoring models continue to grow in size, with the best models being comparable to generation models. Thus, reranking can add substantial computational cost to the translation pipeline. In this work, we pose reranking as a Bayesian optimization (BayesOpt) problem. By strategically selecting candidates to score based on a balance of exploration and exploitation, we show that it is possible to find top-scoring candidates when scoring only a fraction of the candidate list. For instance, our method achieves the same CometKiwi score using only 70 scoring evaluations compared a baseline system using 180. We present a multi-fidelity setting for BayesOpt, where the candidates are first scored with a cheaper but noisier proxy scoring model, which further improves the cost-performance tradeoff when using smaller but well-trained distilled proxy scorers.

artificial intelligence, computational linguistic, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.09694

Country:

Europe (0.28)
Asia > Thailand (0.14)

Genre: Research Report > Experimental Study (0.46)

Industry: Energy > Oil & Gas > Upstream (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Methods of Automatic Matrix Language Determination for Code-Switched Speech

Iakovenko, Olga, Hain, Thomas

arXiv.org Artificial IntelligenceNov-14-2024

Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increasingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language, which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based on F1 macro (60%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are preferred over the English language as the ML contrary to the monolingual choice of LID.

machine learning, natural language, utterance, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.48448/whxj-by83

2410.02521

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Africa > South Africa (0.04)
Africa > Nigeria (0.04)
(5 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Speech (0.68)

Add feedback

Continuous Rating as Reliable Human Evaluation of Simultaneous Speech Translation

Javorský, Dávid, Macháček, Dominik, Bojar, Ondřej

arXiv.org Artificial IntelligenceNov-14-2024

Simultaneous speech translation (SST) can be evaluated on simulated online events where human evaluators watch subtitled videos and continuously express their satisfaction by pressing buttons (so called Continuous Rating). Continuous Rating is easy to collect, but little is known about its reliability, or relation to comprehension of foreign language document by SST users. In this paper, we contrast Continuous Rating with factual questionnaires on judges with different levels of source language knowledge. Our results show that Continuous Rating is easy and reliable SST quality assessment if the judges have at least limited knowledge of the source language. Our study indicates users' preferences on subtitle layout and presentation style and, most importantly, provides a significant evidence that users with advanced source language knowledge prefer low latency over fewer re-translations.

continuous rating, subtitle, translation, (15 more...)

arXiv.org Artificial Intelligence

2203.02458

Country:

Europe > Slovenia (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Direct Speech-to-Speech Neural Machine Translation: A Survey

Gupta, Mahendra, Dutta, Maitreyee, Maurya, Chandresh Kumar

arXiv.org Artificial IntelligenceNov-13-2024

Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a comprehensive review of direct S2ST models, data and application issues, and performance metrics. We critically analyze the models' performance over the benchmark datasets and provide research challenges and future directions.

machine learning, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2411.14453

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > Canada > Ontario > Toronto (0.04)
(34 more...)

Genre: Overview (1.00)

Industry:

Education (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Theoretical Analysis of Byte-Pair Encoding

Kozma, László, Voderholzer, Johannes

arXiv.org Artificial IntelligenceNov-13-2024

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM) pretraining, to create a token dictionary of a prescribed size. Most evaluations of BPE to date are empirical, and the reasons for its good practical performance are not well understood. In this paper we focus on the optimization problem underlying BPE: finding a pair encoding that achieves optimal compression utility. We show that this problem is APX-complete, indicating that it is unlikely to admit a polynomial-time approximation scheme. This answers, in a stronger form, a question recently raised by Zouhar et al. On the positive side, we show that BPE approximates the compression utility of the optimal pair encoding to a worst-case factor between $0.333$ and $0.625$. Our results aim to explain the ongoing success of BPE and are, to our knowledge, the first rigorous guarantees on its compression utility that hold for all inputs.

algorithm, bpe, merge sequence, (13 more...)

arXiv.org Artificial Intelligence

2411.08671

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Berlin (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)

Add feedback

Refining Translations with LLMs: A Constraint-Aware Iterative Prompting Approach

Chen, Shangfeng, Shi, Xiayang, Li, Pu, Li, Yinlin, Liu, Jingjing

arXiv.org Artificial IntelligenceNov-13-2024

Large language models (LLMs) have demonstrated remarkable proficiency in machine translation (MT), even without specific training on the languages in question. However, translating rare words in low-resource or domain-specific contexts remains challenging for LLMs. To address this issue, we propose a multi-step prompt chain that enhances translation faithfulness by prioritizing key terms crucial for semantic accuracy. Our method first identifies these keywords and retrieves their translations from a bilingual dictionary, integrating them into the LLM's context using Retrieval-Augmented Generation (RAG). We further mitigate potential output hallucinations caused by long prompts through an iterative self-checking mechanism, where the LLM refines its translations based on lexical and semantic constraints. Experiments using Llama and Qwen as base models on the FLORES-200 and WMT datasets demonstrate significant improvements over baselines, highlighting the effectiveness of our approach in enhancing translation faithfulness and robustness, particularly in low-resource scenarios.

llm, machine translation, translation, (13 more...)

arXiv.org Artificial Intelligence

2411.08348

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Texas (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Deceiving Question-Answering Models: A Hybrid Word-Level Adversarial Approach

Li, Jiyao, Ni, Mingze, Gong, Yongshun, Liu, Wei

arXiv.org Artificial IntelligenceNov-12-2024

Deep learning underpins most of the currently advanced natural language processing (NLP) tasks such as textual classification, neural machine translation (NMT), abstractive summarization and question-answering (QA). However, the robustness of the models, particularly QA models, against adversarial attacks is a critical concern that remains insufficiently explored. This paper introduces QA-Attack (Question Answering Attack), a novel word-level adversarial strategy that fools QA models. Our attention-based attack exploits the customized attention mechanism and deletion ranking strategy to identify and target specific words within contextual passages. It creates deceptive inputs by carefully choosing and substituting synonyms, preserving grammatical integrity while misleading the model to produce incorrect responses. Our approach demonstrates versatility across various question types, particularly when dealing with extensive long textual inputs. Extensive experiments on multiple benchmark datasets demonstrate that QA-Attack successfully deceives baseline QA models and surpasses existing adversarial techniques regarding success rate, semantics changes, BLEU score, fluency and grammar error rate.

computational linguistic, machine learning, question answering, (22 more...)

arXiv.org Artificial Intelligence

2411.08248

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(16 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Role of Speech Data in Reducing Toxicity Detection Bias

Bell, Samuel J., Meglioli, Mariano Coria, Richards, Megan, Sánchez, Eduardo, Ropers, Christophe, Wang, Skyler, Williams, Adina, Sagun, Levent, Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceNov-12-2024

Text toxicity detection systems exhibit significant biases, producing disproportionate rates of false positives on samples mentioning demographic groups. But what about toxicity detection in speech? To investigate the extent to which text-based biases are mitigated by speech-based systems, we produce a set of high-quality group annotations for the multilingual MuTox dataset, and then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions, particularly for ambiguous and disagreement-inducing samples. Our results also suggest that improving classifiers, rather than transcription pipelines, is more helpful for reducing group bias. We publicly release our annotations and provide recommendations for future toxicity dataset construction.

annotator, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.08135

Country:

Asia > Middle East > Israel (0.04)
North America > United States > New York (0.04)
North America > Canada > Quebec > Montreal (0.04)
(4 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

'It gets more and more confused': can AI replace translators?

The GuardianNov-11-2024, 11:10:00 GMT

As anyone who has tried pointing their phone's camera at a menu in a foreign country lately will know, machine translation has improved rapidly since the first days of Google Translate. The utility of AI-powered translation in situations like this is unquestionable – but the proposed use of AI in literary translation has been significantly more controversial. Dutch publisher Veen Bosch & Keuning's announcement that it would use AI translation for commercial fiction has outraged both authors and translators – despite attempts to reassure them with promises that no books will be translated in this way without careful checking and that authors will have to give consent. "A translator translates more than just words, we build bridges between cultures, taking into account the target readership every step of the way," says Michele Hutchison, winner of 2020's International Booker prize for her translation of Lucas Rijneveld's The Discomfort of Evening. "We smuggle in subtle clues to help the reader understand particular cultural elements or traditions. We convey rhythm, poetry, wordplay, metaphor. We research the precise terminology for say agricultural machinery, even in a novel."

artificial intelligence, natural language, translation, (12 more...)

The Guardian

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.57)

Add feedback

Towards Characterizing Cyber Networks with Large Language Models

Hartsock, Alaric, Pereira, Luiz Manella, Fink, Glenn

arXiv.org Artificial IntelligenceNov-11-2024

Threat hunting analyzes large, noisy, high-dimensional data to find sparse adversarial behavior. We believe adversarial activities, however they are disguised, are extremely difficult to completely obscure in high dimensional space. In this paper, we employ these latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM). CLEM was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things (IoT) cybersecurity testbed. The model is deliberately overtrained on a sliding window of data to characterize each window closely. We use the Adjusted Rand Index (ARI) to comparing the k-means clustering of CLEM output to expert labeling of the embeddings. Our approach demonstrates that there is promise in using natural language modeling to understand cyber data.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.07089

Country:

North America > United States > South Carolina > Aiken County > Aiken (0.04)
North America > United States > Washington > Benton County > Richland (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
Europe > Portugal > Porto > Porto (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Energy (1.00)
Government > Military > Cyberwarfare (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback