AITopics

2307.07025

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(13 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Dinh, Tu Anh, Niehues, Jan

Perturbation-based QE: An Explainable, Unsupervised Word-level Quality Estimation Method for Blackbox Machine Translation

arXiv.org Artificial IntelligenceJul-13-2023

Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which requires glass-box access to the MT systems, or parallel MT data to generate synthetic errors for training QE models. In this paper, we present Perturbation-based QE - a word-level Quality Estimation approach that works simply by analyzing MT system output on perturbed input source sentences. Our approach is unsupervised, explainable, and can evaluate any type of blackbox MT systems, including the currently prominent large language models (LLMs) with opaque internal processes. For language directions with no labeled QE data, our approach has similar or better performance than the zero-shot supervised approach on the WMT21 shared task. Our approach is better at detecting gender bias and word-sense-disambiguation errors in translation than supervised QE, indicating its robustness to out-of-domain usage. The performance gap is larger when detecting errors on a nontraditional translation-prompting LLM, indicating that our approach is more generalizable to different MT systems. We give examples demonstrating our approach's explainability power, where it shows which input source words have influence on a certain MT output word.

artificial intelligence, machine translation, natural language, (14 more...)

2305.07457

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Canada > Quebec > Montreal (0.04)
(10 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceJul-12-2023

Pluggable Neural Machine Translation Models via Memory-augmented Adapters

Xu, Yuzhuang, Wang, Shuo, Li, Peng, Liu, Xuebo, Wang, Xiaolong, Liu, Weidong, Liu, Yang

Although neural machine translation (NMT) models perform well in the general domain, it remains rather challenging to control their generation behavior to satisfy the requirement of different users. Given the expensive training cost and the data scarcity challenge of learning a new model from scratch for each user requirement, we propose a memory-augmented adapter to steer pretrained NMT models in a pluggable manner. Specifically, we construct a multi-granular memory based on the user-provided text samples and propose a new adapter architecture to combine the model representations and the retrieved results. We also propose a training strategy using memory dropout to reduce spurious dependencies between the NMT model and the memory. We validate our approach on both style- and domain-specific experiments and the results indicate that our method can outperform several representative pluggable baselines.

artificial intelligence, natural language, proceedings, (17 more...)

2307.06029

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceJul-12-2023

KIT's Multilingual Speech Translation System for IWSLT 2023

Liu, Danni, Nguyen, Thai Binh, Koneru, Sai, Ugan, Enes Yavuz, Pham, Ngoc-Quan, Nguyen, Tuan-Nam, Dinh, Tu Anh, Mullov, Carlos, Waibel, Alexander, Niehues, Jan

Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.

artificial intelligence, natural language, translation, (17 more...)

2306.0532

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
(17 more...)

Genre: Research Report (0.82)

Industry: Education (0.55)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Ward, Nigel G., Avila, Jonathan E., Rivas, Emilia, Marco, Divette

Dialogs Re-enacted Across Languages

arXiv.org Artificial IntelligenceJul-12-2023

For example, you might say: The purpose of this data collection is to further speech-to-speech translation research by creating an open collection of translated conversations, something that has not been done before. Today you will have a conversation with your partner in one language, then re-enact parts of it in another language. I will select some snippets of the audio and replay 7 them for you to translate re-record in the other language. It is important that you try your best to make it sound natural while also keeping the same feeling as in the original. Try to recreate pauses, laughs, long breaths, or anything of that sort during the second recording if possible. I can replay the audio as many times as you need, and give you as much time as you need to translate. If either of us feel like you can translate the words better or if the prosody was not as faithful in feeling as it could be, then we can redo it until we are satisfied. Please be vocal of any opinions that you have about the process and ask any questions that may arise.

artificial intelligence, machine learning, natural language, (19 more...)

2211.11584

Country:

North America > United States > Texas (0.14)
North America > Mexico > Chihuahua > Chihuahua (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine (1.00)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.93)

Almeida, Thales Sales, Laitz, Thiago, Bonás, Giovana K., Nogueira, Rodrigo

BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese.

bluex, dataset, language model, (16 more...)

2307.0541

Country:

South America > Brazil (0.25)
North America > United States (0.04)
Europe > Portugal > Évora > Évora (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (1.00)

Industry:

Education > Educational Setting (0.68)
Education > Assessment & Standards > Student Performance (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.49)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Yang, Wayne, Nicolai, Garrett

Neural Machine Translation Data Generation and Augmentation using ChatGPT

Neural models have revolutionized the field of machine translation, but creating parallel corpora is expensive and time-consuming. We investigate an alternative to manual parallel corpora - hallucinated parallel corpora created by generative language models. Although these models are themselves trained on parallel data, they can leverage a multilingual vector space to create data, and may be able to supplement small manually-procured corpora. Our experiments highlight two key findings - despite a lack of diversity in their output, the hallucinated data improves the translation signal, even when the domain clashes with the original dataset.

artificial intelligence, machine learning, natural language, (17 more...)

2307.05779

Country:

Europe > Germany > Berlin (0.05)
North America > Canada > British Columbia (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Joshi, Abhinav, Agrawal, Susmit, Modi, Ashutosh

ISLTranslate: Dataset for Translating Indian Sign Language

Sign languages are the primary means of communication for many hard-of-hearing people worldwide. Recently, to bridge the communication gap between the hard-of-hearing community and the rest of the population, several sign language translation datasets have been proposed to enable the development of statistical sign language translation systems. However, there is a dearth of sign language resources for the Indian sign language. This resource paper introduces ISLTranslate, a translation dataset for continuous Indian Sign Language (ISL) consisting of 31k ISL-English sentence/phrase pairs. To the best of our knowledge, it is the largest translation dataset for continuous Indian Sign Language. We provide a detailed analysis of the dataset. To validate the performance of existing end-to-end Sign language to spoken language translation systems, we benchmark the created dataset with a transformer-based model for ISL translation.

artificial intelligence, natural language, translation, (15 more...)

2307.0544

Country:

Asia > India (0.15)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > United States > Pennsylvania (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.40)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Bhattacharyya, Pramit, Mondal, Joydeep, Maji, Subhadip, Bhattacharya, Arnab

Vacaspati: A Diverse Corpus of Bangla Literature

Bangla (or Bengali) is the fifth most spoken language globally; yet, the state-of-the-art NLP in Bangla is lagging for even simple tasks such as lemmatization, POS tagging, etc. This is partly due to lack of a varied quality corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla literature. The literary works are collected from various websites; only those works that are publicly available without copyright violations or restrictions are collected. We believe that published literature captures the features of a language much better than newspapers, blogs or social media posts which tend to follow only a certain literary pattern and, therefore, miss out on language variety. Our corpus Vacaspati is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word embedding model, Vac-FT, using FastText from Vacaspati as well as trained an Electra model, Vac-BERT, using the corpus. Vac-BERT has far fewer parameters and requires only a fraction of resources compared to other state-of-the-art transformer models and yet performs either better or similar on various downstream tasks. On multiple downstream tasks, Vac-FT outperforms other FastText-based models. We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective. The models are available at https://bangla.iitk.ac.in/.

corpus, machine learning, natural language, (22 more...)

2307.05083

Country:

Asia > India > West Bengal > Kolkata (0.14)
Asia > Bangladesh (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(15 more...)

Genre: Research Report (0.82)

Industry:

Media (0.35)
Law > Intellectual Property & Technology Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Communications > Social Media (0.91)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
(2 more...)

Faithful Low-Resource Data-to-Text Generation through Cycle Training

Wang, Zhuoer, Collins, Marcus, Vedula, Nikhita, Filice, Simone, Malmasi, Shervin, Rokhlenko, Oleg

Methods to generate text from structured data have advanced significantly in recent years, primarily due to fine-tuning of pre-trained language models on large datasets. However, such models can fail to produce output faithful to the input data, particularly on out-of-domain data. Sufficient annotated data is often not available for specific domains, leading us to seek an unsupervised approach to improve the faithfulness of output text. Since the problem is fundamentally one of consistency between the representations of the structured data and text, we evaluate the effectiveness of cycle training in this work. Cycle training uses two models which are inverses of each other: one that generates text from structured data, and one which generates the structured data from natural language text. We show that cycle training, when initialized with a small amount of supervised data (100 samples in our case), achieves nearly the same performance as fully supervised approaches for the data-to-text generation task on the WebNLG, E2E, WTQ, and WSQL datasets. We perform extensive empirical analysis with automated evaluation metrics and a newly designed human evaluation schema to reveal different cycle training strategies' effectiveness of reducing various types of generation errors. Our code is publicly available at https://github.com/Edillower/CycleNLG.

information retrieval, machine learning, natural language, (21 more...)

2305.14793

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Israel (0.08)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.05)
(18 more...)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment (0.69)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
(2 more...)