AITopics

2406.00021

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Gete, Harritxu, Etchegoyhen, Thierry

Does Context Help Mitigate Gender Bias in Neural Machine Translation?

arXiv.org Artificial IntelligenceJun-18-2024

First, we evaluated the performance of contextaware models in the translation of stereotypical Neural machine translation (NMT) models tend to professions from English into German and French, exhibit gender bias, originating from their training measuring translation accuracy on gender-based data (Stanovsky et al., 2019; Saunders and subsets of the data. Our results in this case indicate Byrne, 2020). A typical example is the translation that, although context-aware models lead to significantly of gender-neutral professions in a language like increasing the use of feminine forms, this English, into languages with differentiated feminine was achieved mainly for professions that are stereotypically and masculine forms. In this case, NMT systems viewed as feminine, thus with limited bias often produce translations that reflect genderstereotypical mitigation.

gender bia, machine translation, translation, (12 more...)

2406.12364

Country:

Europe > Italy > Tuscany > Florence (0.05)
South America > Argentina (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(7 more...)

Genre: Research Report > New Finding (0.90)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Iana, Andreea, Schmidt, Fabian David, Glavaš, Goran, Paulheim, Heiko

News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation

arXiv.org Artificial IntelligenceJun-18-2024

Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data is scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we construct and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click-behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation.

computational linguistic, proceedings, recommendation, (15 more...)

2406.12634

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > Niger (0.05)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(17 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Los Angeles TimesJun-17-2024, 20:04:12 GMT

California plans to enlist AI to translate healthcare information

In Spanish, there are at least a dozen ways to say someone has the flu -- depending on the country. Translating "cardiac arrest" into Spanish is also tricky because "arresto" means getting detained by the police. Likewise, "intoxicado" means you have food poisoning, not that you're drunk. The examples of how translation could go awry in any language are endless: Words take on new meanings, idioms come and go, and communities adopt slang and dialects for everyday life. Human translators work hard to keep up with the changes, but California plans to soon entrust that responsibility to technology. State health policy officials want to harness emerging artificial intelligence technology to translate a broad swath of documents and websites related to "health and social services information, programs, benefits and services," according to state records.

california plan, information, translation, (12 more...)

Los Angeles Times

Country: North America > United States > California > San Francisco County > San Francisco (0.05)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.97)

Kocmi, Tom, Zouhar, Vilém, Avramidis, Eleftherios, Grundkiewicz, Roman, Karpinska, Marzena, Popović, Maja, Sachan, Mrinmaya, Shmatova, Mariya

Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but are less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

annotator, machine translation, protocol, (10 more...)

2406.1158

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Liu, Shuai, May, Jonathan

Style Transfer with Multi-iteration Preference Optimization

Numerous recent techniques for text style transfer characterize their approaches as variants of reinforcement learning and preference optimization. In this work, we consider the relationship between these approaches and a class of optimization approaches developed primarily for (non-neural) statistical machine translation, formerly known as 'tuning'. Inspired by these techniques from the past, we improve upon established preference optimization approaches, incorporating multiple iterations of exploration and optimization, and choosing contrastive examples by following a 'hope' vs 'fear' sampling strategy. Cognizant of the difference between machine translation and style transfer, however, we further tailor our framework with a new pseudo-parallel generation method and a dynamic weighted reward aggregation method to tackle the lack of parallel data and the need for a multi-objective reward. We evaluate our model on two commonly used text style transfer datasets. Through automatic and human evaluation results we show the effectiveness and the superiority of our model compared to state-of-the-art baselines.

computational linguistic, dataset, style transfer, (15 more...)

2406.11581

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California (0.14)
Asia > Singapore (0.04)
(15 more...)

Genre:

Research Report > Experimental Study (0.69)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Lyu, Boxuan, Kamigaito, Hidetaka, Funakoshi, Kotaro, Okumura, Manabu

Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation

Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility. In this work, we show that Quality Estimation (QE) reranking, which uses a QE model as a reranker, can be viewed as a variant of MBR. Inspired by this, we propose source-based MBR (sMBR) decoding, a novel approach that utilizes synthetic sources generated by backward translation as ``support hypotheses'' and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR significantly outperforms QE reranking and is competitive with standard MBR decoding. Furthermore, sMBR calls the utility function fewer times compared to MBR. Our findings suggest that sMBR is a promising approach for high-quality NMT decoding.

hypothesis, mbr, translation, (14 more...)

2406.11632

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > Singapore (0.04)
(15 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Keita, Mamadou K., Ibrahim, Elysabhete Amadou, Alfari, Habibatou Abdoulaye, Homan, Christopher

Feriji: A French-Zarma Parallel Corpus, Glossary & Translator

Machine translation (MT) is a rapidly expanding field that has experienced significant advancements in recent years with the development of models capable of translating multiple languages with remarkable accuracy. However, the representation of African languages in this field still needs to improve due to linguistic complexities and limited resources. This applies to the Zarma language, a dialect of Songhay (of the Nilo-Saharan language family) spoken by over 5 million people across Niger and neighboring countries \cite{lewis2016ethnologue}. This paper introduces Feriji, the first robust French-Zarma parallel corpus and glossary designed for MT. The corpus, containing 61,085 sentences in Zarma and 42,789 in French, and a glossary of 4,062 words represent a significant step in addressing the need for more resources for Zarma. We fine-tune three large language models on our dataset, obtaining a BLEU score of 30.06 on the best-performing model. We further evaluate the models on human judgments of fluency, comprehension, and readability and the importance and impact of the corpus and models. Our contributions help to bridge a significant language gap and promote an essential and overlooked indigenous African language.

feriji, translation, zarma, (16 more...)

2406.05888

Country:

Africa > Niger (0.25)
North America > United States (0.14)
Africa > Sub-Saharan Africa (0.04)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report (0.82)

Industry:

Health & Medicine (0.47)
Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Bafna, Niyati, Koehn, Philipp, Yarowsky, David

Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That!

While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural "shortcuts", such as copying subwords from the source to the target, given that such language pairs often share a considerable number of identical words, cognates, and borrowings. We test Pointer-Generator Networks for this purpose for six language pairs over a variety of resource ranges, and find weak improvements for most settings. However, analysis shows that the model does not show greater improvements for closely-related vs. more distant language pairs, or for lower resource ranges, and that the models do not exhibit the expected usage of the mechanism for shared subwords. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT, such as modern tokenization strategies, noisy real-world conditions, and linguistic complexities. We call for better scrutiny of linguistically motivated improvements to NMT given the blackbox nature of Transformer models, as well as for a focus on the above problems in the field.

machine translation, mechanism, translation, (14 more...)

2403.10963

Country:

North America > Dominican Republic (0.04)
Europe > Spain (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
(9 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

AnyTrans: Translate AnyText in the Image with Large Scale Models

Qian, Zhipeng, Zhang, Pei, Yang, Baosong, Fan, Kai, Ma, Yiwei, Wong, Derek F., Sun, Xiaoshuai, Ji, Rongrong

This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.

arxiv, cornell university, translation, (15 more...)

2406.11432

Country:

North America > Mexico (0.04)
Asia > Macao (0.04)
Asia > Japan (0.04)
Asia > China > Fujian Province > Xiamen (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)