AITopics

2304.01352

Country:

Asia > Russia (0.14)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Armenia > Yerevan > Yerevan (0.04)

Genre: Research Report (0.64)

Industry: Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Costa-jussà, Marta R., Smith, Eric, Ropers, Christophe, Licht, Daniel, Maillard, Jean, Ferrando, Javier, Escolano, Carlos

Toxicity in Multilingual Machine Translation at Scale

arXiv.org Artificial IntelligenceApr-5-2023

Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. An automatic toxicity evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. Making use of the input attributions allows us to explain toxicity, because the source contributions significantly correlate with toxicity for 84% of languages studied. Given our findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.

machine learning, natural language, toxicity, (16 more...)

2210.0307

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
North America > United States > Pennsylvania (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.50)

Hamed, Injy, Habash, Nizar, Abdennadher, Slim, Vu, Ngoc Thang

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.

artificial intelligence, machine learning, natural language, (19 more...)

2205.12649

Country: Europe (0.28)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.34)

Industry: Energy > Oil & Gas > Downstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Fatima, Mehwish, Kolber, Tim, Markert, Katja, Strube, Michael

SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism

Cross-lingual science journalism generates popular science stories of scientific articles different from the source language for a non-expert audience. Hence, a cross-lingual popular summary must contain the salient content of the input document, and the content should be coherent, comprehensible, and in a local language for the targeted audience. We improve these aspects of cross-lingual summary generation by joint training of two high-level NLP tasks, simplification and cross-lingual summarization. The former task reduces linguistic complexity, and the latter focuses on cross-lingual abstractive summarization. We propose a novel multi-task architecture - SimCSum consisting of one shared encoder and two parallel decoders jointly learning simplification and cross-lingual summarization. We empirically investigate the performance of SimCSum by comparing it with several strong baselines over several evaluation metrics and by human evaluation. Overall, SimCSum demonstrates statistically significant improvements over the state-of-the-art on two non-synthetic cross-lingual scientific datasets. Furthermore, we conduct an in-depth investigation into the linguistic properties of generated summaries and an error analysis.

artificial intelligence, machine learning, natural language, (18 more...)

2304.01621

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Bavaria > Regensburg (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
(14 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Media > News (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Lupo, Lorenzo, Dinarelli, Marco, Besacier, Laurent

Encoding Sentence Position in Context-Aware Neural Machine Translation with Concatenation

Context-aware translation can be achieved by processing a concatenation of consecutive sentences with the standard Transformer architecture. This paper investigates the intuitive idea of providing the model with explicit information about the position of the sentences contained in the concatenation window. We compare various methods to encode sentence positions into token representations, including novel methods. Our results show that the Transformer benefits from certain sentence position encoding methods on English to Russian translation if trained with a context-discounted loss (Lupo et al., 2022). However, the same benefits are not observed in English to German. Further empirical efforts are necessary to define the conditions under which the proposed approach is beneficial.

artificial intelligence, computational linguistic, natural language, (15 more...)

2302.06459

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(12 more...)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Montanelli, Stefano, Periti, Francesco

A Survey on Contextualised Semantic Shift Detection

Semantic Shift Detection (SSD) is the task of identifying, interpreting, and assessing the possible change over time in the meanings of a target word. Traditionally, SSD has been addressed by linguists and social scientists through manual and time-consuming activities. In the recent years, computational approaches based on Natural Language Processing and word embeddings gained increasing attention to automate SSD as much as possible. In particular, over the past three years, significant advancements have been made almost exclusively based on word contextualised embedding models, which can handle the multiple usages/meanings of the words and better capture the related semantic shifts. In this paper, we survey the approaches based on contextualised embeddings for SSD (i.e., CSSDetection) and we propose a classification framework characterised by meaning representation, time-awareness, and learning modality dimensions. The framework is exploited i) to review the measures for shift assessment, ii) to compare the approaches on performance, and iii) to discuss the current issues in terms of scalability, interpretability, and robustness. Open challenges and future research directions about CSSDetection are finally outlined.

computational linguistic, machine learning, natural language, (18 more...)

2304.01666

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(21 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Li, Bryan, Rasooli, Mohammad Sadegh, Patel, Ajay, Callison-Burch, Chris

Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation

arXiv.org Artificial IntelligenceApr-3-2023

We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning on parallel data in 40 languages to English. We find this model can generalize to zero-shot translations on unseen languages. For the second stage, we leverage this generalization ability to generate synthetic parallel data from monolingual datasets, then bidirectionally train with successive rounds of back-translation. Our approach, which we EcXTra (English-centric Crosslingual (X) Transfer), is conceptually simple, only using a standard cross-entropy objective throughout. It is also data-driven, sequentially leveraging auxiliary parallel data and monolingual data. We evaluate unsupervised NMT results for 7 low-resource languages, and find that each round of back-translation training further refines bidirectional performance. Our final single EcXTra-trained model achieves competitive translation performance in all translation directions, notably establishing a new state-of-the-art for English-to-Kazakh (22.9 > 10.4 BLEU). Our code is available at https://github.com/manestay/EcXTra .

machine learning, natural language, translation, (18 more...)

2209.02821

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
(9 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Yadav, Ankit, Chandel, Shubham, Chatufale, Sushant, Bandhakavi, Anil

LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

arXiv.org Artificial IntelligenceApr-3-2023

Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general.

artificial intelligence, machine learning, natural language, (18 more...)

2304.00913

Country:

Europe > United Kingdom (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Belgium > Flanders > East Flanders > Ghent (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology > Services (0.68)
Law > Civil Rights & Constitutional Law (0.51)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

arXiv.org Artificial IntelligenceApr-3-2023

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

Lin, Jimmy, Alfonso-Hermelo, David, Jeronymo, Vitor, Kamalloo, Ehsan, Lassance, Carlos, Nogueira, Rodrigo, Ogundepo, Odunayo, Rezagholizadeh, Mehdi, Thakur, Nandan, Yang, Jheng-Hong, Zhang, Xinyu

The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led to a confusing panoply of methods and reproducibility has lagged behind the state of the art. In this context, our work makes two important contributions: First, we provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. Second, we implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese. Our efforts are built on a collaboration of the two teams that submitted the most effective runs to the TREC evaluation. These contributions provide a firm foundation for future advances.

information retrieval, natural language, translation, (14 more...)

2304.01019

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.05)
(13 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

arXiv.org Artificial IntelligenceApr-2-2023

Semi-supervised Neural Machine Translation with Consistency Regularization for Low-Resource Languages

Pham, Viet H., Pham, Thang M., Nguyen, Giang, Nguyen, Long, Dinh, Dien

The advent of deep learning has led to a significant gain in machine translation. However, most of the studies required a large parallel dataset which is scarce and expensive to construct and even unavailable for some languages. This paper presents a simple yet effective method to tackle this problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner. Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences derived from the model. We also introduce a SentenceBERT-based filter to enhance the quality of augmenting data by retaining semantically similar sentence pairs. Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46--2.03 BLEU scores. We also demonstrate that using unsupervised training for augmented data is more efficient than reusing the ground-truth target sentences for supervised learning.

machine learning, natural language, translation, (14 more...)

2304.00557

Country:

Europe > Germany > Berlin (0.05)
Europe > France (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(5 more...)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)