AITopics

Although recent years have brought significant progress in improving translation of unambiguously gendered sentences, translation of ambiguously gendered input remains relatively unexplored. When source gender is ambiguous, machine translation models typically default to stereotypical gender roles, perpetuating harmful bias. Recent work has led to the development of "gender rewriters" that generate alternative gender translations on such ambiguous inputs, but such systems are plagued by poor linguistic coverage. To encourage better performance on this task we present and release GATE, a linguistically diverse corpus of gender-ambiguous source sentences along with multiple alternative target language translations. We also provide tools for evaluation and system analysis when using GATE and use them to evaluate our translation rewriter system.

machine learning, natural language, translation, (18 more...)

2303.03975

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Zhang, Ziqiang, Zhou, Long, Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.

artificial intelligence, machine learning, natural language, (18 more...)

2303.03926

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Zhang, Ruochen, Eickhoff, Carsten

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at https://github.com/RosenZhang/CroCoSum.

artificial intelligence, machine learning, natural language, (16 more...)

2303.04092

Country:

North America > United States (0.14)
Asia > China > Hong Kong (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)

Ferlin, Maria, Majchrowska, Sylwia, Plantykow, Marta, Kwaśniwska, Alicja, Mikołajczyk-Bareła, Agnieszka, Olech, Milena, Nalepa, Jakub

On the Importance of Sign Labeling: The Hamburg Sign Language Notation System Case Study

Labeling is the cornerstone of supervised machine learning, which has been exploited in a plethora of various applications, with sign language recognition being one of them. However, such algorithms must be fed with a huge amount of consistently labeled data during the training process to elaborate a well-generalizing model. In addition, there is a great need for an automated solution that works with any nationally diversified sign language. Although there are language-agnostic transcription systems, such as the Hamburg Sign Language Notation System (HamNoSys) that describe the signer's initial position and body movement instead of the glosses' meanings, there are still issues with providing accurate and reliable labels for every real-world use case. In this context, the industry relies heavily on manual attribution and labeling of the available video data. In this work, we tackle this issue and thoroughly analyze the HamNoSys labels provided by various maintainers of open sign language corpora in five sign languages, in order to examine the challenges encountered in labeling video data. We also investigate the consistency and objectivity of HamNoSys-based labels for the purpose of training machine learning models. Our findings provide valuable insights into the limitations of the current labeling methods and pave the way for future research on developing more accurate and efficient solutions for sign language recognition.

artificial intelligence, machine learning, natural language, (17 more...)

2302.10768

Country:

Europe > Poland > Pomerania Province > Gdańsk (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Oceania > Australia (0.04)
(16 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.48)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Laurençon, Hugo, Saulnier, Lucile, Wang, Thomas, Akiki, Christopher, del Moral, Albert Villanova, Scao, Teven Le, Von Werra, Leandro, Mou, Chenghao, Ponferrada, Eduardo González, Nguyen, Huu, Frohberg, Jörg, Šaško, Mario, Lhoest, Quentin, McMillan-Major, Angelina, Dupont, Gerard, Biderman, Stella, Rogers, Anna, allal, Loubna Ben, De Toni, Francesco, Pistilli, Giada, Nguyen, Olivier, Nikpoor, Somaieh, Masoud, Maraim, Colombo, Pierre, de la Rosa, Javier, Villegas, Paulo, Thrush, Tristan, Longpre, Shayne, Nagel, Sebastian, Weber, Leon, Muñoz, Manuel, Zhu, Jian, Van Strien, Daniel, Alyafeai, Zaid, Almubarak, Khalid, Vu, Minh Chien, Gonzalez-Dios, Itziar, Soroa, Aitor, Lo, Kyle, Dey, Manan, Suarez, Pedro Ortiz, Gokaslan, Aaron, Bose, Shamik, Adelani, David, Phan, Long, Tran, Hieu, Yu, Ian, Pai, Suhas, Chim, Jenny, Lepercq, Violette, Ilic, Suzana, Mitchell, Margaret, Luccioni, Sasha Alexandra, Jernite, Yacine

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM)(BigScience Workshop, 2022) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

large language model, machine learning, natural language, (18 more...)

2303.03915

Country:

Africa > Niger (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(30 more...)

Genre: Research Report (0.50)

Industry:

Media (0.92)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

SlateMar-6-2023, 10:50:00 GMT

The Tech That Helped Me Through My Dad's Death--and in Some Ways, Keeps Him Alive

When my dad died, he became part of the cloud. Not the one up high in the sky, but rather an online cumulus that now stores and archives a record of his last 18 months on earth. On my laptop, and even more prominently on my phone, I carry with me digital traces of my dad that I can't yet bring myself to access. Four years after his death, I still sit with a kind of grief that remains more raw than residual, and his memory lingers in digital purgatory--undeleted yet untouched; saved but not sought. He "lives" in this liminal digital space; like a grave I can't yet bring myself to visit, but simply know is there.

google translate, kakaotalk, translation, (6 more...)

Slate

Country:

North America > United States > Missouri (0.05)
North America > United States > California (0.05)
North America > United States > Arizona (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.45)

#artificialintelligenceMar-6-2023, 09:59:38 GMT

AI translation firm unveils 'world-first' timeline to singularity

An Italian company has unveiled a novel method of measuring AI progress: analyzing improvements in machine translation. Translated, a provider of translation services, used the approach to predict when we will achieve singularity, a vague concept often defined as the point where machines become smarter than humans. The Rome-based business sets this milestone at the moment when AI provides "a perfect translation." According to the new research, this arrives when machine translation (MT) is better than top human translations. Translated's analysis suggests this will happen before the end of the 2020s.

machine translation, singularity, translation, (12 more...)

#artificialintelligence

Genre: Research Report (0.90)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceMar-5-2023

WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data augmentation in tExt Regression Tasks

Suri, Manan, Garg, Aaryak, Chaudhary, Divya, Gorton, Ian, Kumar, Bijendra

Intimacy is an essential element of human relationships and language is a crucial means of conveying it. Textual intimacy analysis can reveal social norms in different contexts and serve as a benchmark for testing computational models' ability to understand social information. In this paper, we propose a novel weak-labeling strategy for data augmentation in text regression tasks called WADER. WADER uses data augmentation to address the problems of data imbalance and data scarcity and provides a method for data augmentation in cross-lingual, zero-shot tasks. We benchmark the performance of State-of-the-Art pre-trained multilingual language models using WADER and analyze the use of sampling techniques to mitigate bias in data and optimally select augmentation candidates. Our results show that WADER outperforms the baseline model and provides a direction for mitigating data imbalance and scarcity in text regression tasks.

large language model, machine learning, natural language, (17 more...)

2303.02758

Country:

North America > United States > Michigan (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Guerreiro, Nuno M., Voita, Elena, Martins, André F. T.

Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation

arXiv.org Artificial IntelligenceMar-5-2023

Although the problem of hallucinations in neural machine translation (NMT) has received some attention, research on this highly pathological phenomenon lacks solid ground. Previous work has been limited in several ways: it often resorts to artificial settings where the problem is amplified, it disregards some (common) types of hallucinations, and it does not validate adequacy of detection heuristics. In this paper, we set foundations for the study of NMT hallucinations. First, we work in a natural setting, i.e., in-domain data without artificial noise neither in training nor in inference. Next, we annotate a dataset of over 3.4k sentences indicating different kinds of critical errors and hallucinations. Then, we turn to detection methods and both revisit methods used previously and propose using glass-box uncertainty-based detectors. Overall, we show that for preventive settings, (i) previously used methods are largely inadequate, (ii) sequence log-probability works best and performs on par with reference-based methods. Finally, we propose DeHallucinator, a simple method for alleviating hallucinations at test time that significantly reduces the hallucinatory rate. To ease future research, we release our annotated dataset for WMT18 German-English data, along with the model, training data, and code.

artificial intelligence, machine translation, natural language, (14 more...)

2208.05309

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Sweden > Stockholm > Stockholm (0.05)
(14 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Awal, Md Rabiul, Lee, Roy Ka-Wei, Tanwar, Eshaan, Garg, Tanmay, Chakraborty, Tanmoy

Model-Agnostic Meta-Learning for Multilingual Hate Speech Detection

arXiv.org Artificial IntelligenceMar-4-2023

Hate speech in social media is a growing phenomenon, and detecting such toxic content has recently gained significant traction in the research community. Existing studies have explored fine-tuning language models (LMs) to perform hate speech detection, and these solutions have yielded significant performance. However, most of these studies are limited to detecting hate speech only in English, neglecting the bulk of hateful content that is generated in other languages, particularly in low-resource languages. Developing a classifier that captures hate speech and nuances in a low-resource language with limited data is extremely challenging. To fill the research gap, we propose HateMAML, a model-agnostic meta-learning-based framework that effectively performs hate speech detection in low-resource languages. HateMAML utilizes a self-supervision strategy to overcome the limitation of data scarcity and produces better LM initialization for fast adaptation to an unseen target language (i.e., cross-lingual transfer) or other hate speech datasets (i.e., domain generalization). Extensive experiments are conducted on five datasets across eight different low-resource languages. The results show that HateMAML outperforms the state-of-the-art baselines by more than 3% in the cross-domain multilingual transfer setting. We also conduct ablation studies to analyze the characteristics of HateMAML.

hatemaml, speech detection, target language, (15 more...)

2303.02513

Country:

Asia > India > NCT > Delhi (0.05)
Asia > Singapore (0.04)
North America > Canada > Saskatchewan (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)