AITopics

2203.10854

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Oceania > Australia > South Australia > Adelaide (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.96)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

North, Kai, Zampieri, Marcos, Shardlow, Matthew

Lexical Complexity Prediction: An Overview

arXiv.org Artificial IntelligenceMar-8-2023

Understanding the meaning of words in context is fundamental for reading comprehension. The perceived difficulty, hereafter referred to as complexity, of a target word within a given text varies widely among readers. With an increased demand for distance learning and educational technologies[107], research into automatically predicting which words are likely to cause comprehension problems is becoming a popular area of research [115, 147, 185]. Systems have been created to identify complex words that are difficult to acquire, reproduce, or understand for children [79], second-language learners [89], people suffering from a reading disability, such as dyslexia [131] or aphasia [35, 53], or more generally, individuals with low literacy [59, 175]. In Computational Linguistics and Natural Language Processing (NLP), the task of automatically recognizing complex words is most often achieved by training machine learning (ML) models. These ML models assign a complexity value to each target word within an inputted extract, sentence, or text that allows for the identification of complex words. This information can then be used to improve downstream lexical and text simplification systems that provide simpler alternatives to aid reading comprehension. Take the extract shown in Table 1 for example.

artificial intelligence, machine learning, natural language, (16 more...)

doi: 10.1145/3557885

2303.04851

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Thailand > Bangkok > Bangkok (0.05)
(44 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Instructional Material (0.87)

Industry:

Health & Medicine (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.92)
Education > Educational Setting > Online (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Out-of-Distribution Detection and Selective Generation for Conditional Language Models

Ren, Jie, Luo, Jiaming, Zhao, Yao, Krishna, Kundan, Saleh, Mohammad, Lakshminarayanan, Balaji, Liu, Peter J.

Machine learning algorithms typically assume independent and identically distributed samples in training and at test time. Much work has shown that high-performing ML classifiers can degrade significantly and provide overly-confident, wrong classification predictions, particularly for out-of-distribution (OOD) inputs. Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence, and may suffer even worse degradation on OOD inputs as the prediction is done auto-regressively over many steps. Furthermore, the space of potential low-quality outputs is larger as arbitrary text can be generated and it is important to know when to trust the generated output. We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. We also show how our method can be used under the common and realistic setting of distribution shift for selective generation (analogous to selective prediction for classification) of high-quality outputs, while automatically abstaining from low-quality ones, enabling safer deployment of generative language models.

artificial intelligence, machine learning, natural language, (18 more...)

2209.15558

Country:

North America > United States > Missouri > Jackson County > Kansas City (0.14)
North America > Canada > Alberta (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(16 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine (0.67)
Leisure & Entertainment > Sports > Soccer (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Rarrick, Spencer, Naik, Ranjita, Mathur, Varun, Poudel, Sundar, Chowdhary, Vishal

GATE: A Challenge Set for Gender-Ambiguous Translation Examples

Although recent years have brought significant progress in improving translation of unambiguously gendered sentences, translation of ambiguously gendered input remains relatively unexplored. When source gender is ambiguous, machine translation models typically default to stereotypical gender roles, perpetuating harmful bias. Recent work has led to the development of "gender rewriters" that generate alternative gender translations on such ambiguous inputs, but such systems are plagued by poor linguistic coverage. To encourage better performance on this task we present and release GATE, a linguistically diverse corpus of gender-ambiguous source sentences along with multiple alternative target language translations. We also provide tools for evaluation and system analysis when using GATE and use them to evaluate our translation rewriter system.

machine learning, natural language, translation, (18 more...)

2303.03975

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Zhang, Ziqiang, Zhou, Long, Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.

artificial intelligence, machine learning, natural language, (18 more...)

2303.03926

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Zhang, Ruochen, Eickhoff, Carsten

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

Cross-lingual summarization (CLS) has attracted increasing interest in recent years due to the availability of large-scale web-mined datasets and the advancements of multilingual language models. However, given the rareness of naturally occurring CLS resources, the majority of datasets are forced to rely on translation which can contain overly literal artifacts. This restricts our ability to observe naturally occurring CLS pairs that capture organic diction, including instances of code-switching. This alteration between languages in mid-message is a common phenomenon in multilingual settings yet has been largely overlooked in cross-lingual contexts due to data scarcity. To address this gap, we introduce CroCoSum, a dataset of cross-lingual code-switched summarization of technology news. It consists of over 24,000 English source articles and 18,000 human-curated Chinese news summaries, with more than 92% of the summaries containing code-switched phrases. For reference, we evaluate the performance of existing approaches including pipeline, end-to-end, and zero-shot methods. We show that leveraging existing resources as a pretraining step does not improve performance on CroCoSum, indicating the limited generalizability of existing resources. Finally, we discuss the challenges of evaluating cross-lingual summarizers on code-switched generation through qualitative error analyses. Our collection and code can be accessed at https://github.com/RosenZhang/CroCoSum.

artificial intelligence, machine learning, natural language, (16 more...)

2303.04092

Country:

North America > United States (0.14)
Asia > China > Hong Kong (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)

Ferlin, Maria, Majchrowska, Sylwia, Plantykow, Marta, Kwaśniwska, Alicja, Mikołajczyk-Bareła, Agnieszka, Olech, Milena, Nalepa, Jakub

On the Importance of Sign Labeling: The Hamburg Sign Language Notation System Case Study

Labeling is the cornerstone of supervised machine learning, which has been exploited in a plethora of various applications, with sign language recognition being one of them. However, such algorithms must be fed with a huge amount of consistently labeled data during the training process to elaborate a well-generalizing model. In addition, there is a great need for an automated solution that works with any nationally diversified sign language. Although there are language-agnostic transcription systems, such as the Hamburg Sign Language Notation System (HamNoSys) that describe the signer's initial position and body movement instead of the glosses' meanings, there are still issues with providing accurate and reliable labels for every real-world use case. In this context, the industry relies heavily on manual attribution and labeling of the available video data. In this work, we tackle this issue and thoroughly analyze the HamNoSys labels provided by various maintainers of open sign language corpora in five sign languages, in order to examine the challenges encountered in labeling video data. We also investigate the consistency and objectivity of HamNoSys-based labels for the purpose of training machine learning models. Our findings provide valuable insights into the limitations of the current labeling methods and pave the way for future research on developing more accurate and efficient solutions for sign language recognition.

artificial intelligence, machine learning, natural language, (17 more...)

2302.10768

Country:

Europe > Poland > Pomerania Province > Gdańsk (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Oceania > Australia (0.04)
(16 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.48)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Laurençon, Hugo, Saulnier, Lucile, Wang, Thomas, Akiki, Christopher, del Moral, Albert Villanova, Scao, Teven Le, Von Werra, Leandro, Mou, Chenghao, Ponferrada, Eduardo González, Nguyen, Huu, Frohberg, Jörg, Šaško, Mario, Lhoest, Quentin, McMillan-Major, Angelina, Dupont, Gerard, Biderman, Stella, Rogers, Anna, allal, Loubna Ben, De Toni, Francesco, Pistilli, Giada, Nguyen, Olivier, Nikpoor, Somaieh, Masoud, Maraim, Colombo, Pierre, de la Rosa, Javier, Villegas, Paulo, Thrush, Tristan, Longpre, Shayne, Nagel, Sebastian, Weber, Leon, Muñoz, Manuel, Zhu, Jian, Van Strien, Daniel, Alyafeai, Zaid, Almubarak, Khalid, Vu, Minh Chien, Gonzalez-Dios, Itziar, Soroa, Aitor, Lo, Kyle, Dey, Manan, Suarez, Pedro Ortiz, Gokaslan, Aaron, Bose, Shamik, Adelani, David, Phan, Long, Tran, Hieu, Yu, Ian, Pai, Suhas, Chim, Jenny, Lepercq, Violette, Ilic, Suzana, Mitchell, Margaret, Luccioni, Sasha Alexandra, Jernite, Yacine

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM)(BigScience Workshop, 2022) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

large language model, machine learning, natural language, (18 more...)

2303.03915

Country:

Africa > Niger (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(30 more...)

Genre: Research Report (0.50)

Industry:

Media (0.92)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

SlateMar-6-2023, 10:50:00 GMT

The Tech That Helped Me Through My Dad's Death--and in Some Ways, Keeps Him Alive

When my dad died, he became part of the cloud. Not the one up high in the sky, but rather an online cumulus that now stores and archives a record of his last 18 months on earth. On my laptop, and even more prominently on my phone, I carry with me digital traces of my dad that I can't yet bring myself to access. Four years after his death, I still sit with a kind of grief that remains more raw than residual, and his memory lingers in digital purgatory--undeleted yet untouched; saved but not sought. He "lives" in this liminal digital space; like a grave I can't yet bring myself to visit, but simply know is there.

google translate, kakaotalk, translation, (6 more...)

Slate

Country:

North America > United States > Missouri (0.05)
North America > United States > California (0.05)
North America > United States > Arizona (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.45)

#artificialintelligenceMar-6-2023, 09:59:38 GMT

AI translation firm unveils 'world-first' timeline to singularity

An Italian company has unveiled a novel method of measuring AI progress: analyzing improvements in machine translation. Translated, a provider of translation services, used the approach to predict when we will achieve singularity, a vague concept often defined as the point where machines become smarter than humans. The Rome-based business sets this milestone at the moment when AI provides "a perfect translation." According to the new research, this arrives when machine translation (MT) is better than top human translations. Translated's analysis suggests this will happen before the end of the 2020s.

machine translation, singularity, translation, (12 more...)

#artificialintelligence

Genre: Research Report (0.90)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)