AITopics

doi: 10.1145/3613904.3642605

2405.16669

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Africa > Kenya > Nairobi City County > Nairobi (0.04)
(26 more...)

Genre:

Questionnaire & Opinion Survey (0.92)
Research Report > New Finding (0.68)

Industry:

Health & Medicine (1.00)
Information Technology (0.92)
Government (0.67)
Education > Educational Setting (0.46)

Technology:

Information Technology > Knowledge Management (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceMay-26-2024

Crossmodal ASR Error Correction with Discrete Speech Units

Li, Yuanchao, Chen, Pinzhen, Bell, Peter, Lai, Catherine

ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data, as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications.

hypothesis, information, transcript, (12 more...)

2405.16677

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Trentino-Alto Adige/Südtirol > Trentino Province > Trento (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMay-24-2024

Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges

Becker, Jonas, Wahle, Jan Philip, Gipp, Bela, Ruas, Terry

Text generation has become more accessible than ever, and the increasing interest in these systems, especially those using large language models, has spurred an increasing number of related publications. We provide a systematic literature review comprising 244 selected papers between 2017 and 2024. This review categorizes works in text generation into five main tasks: open-ended text generation, summarization, translation, paraphrasing, and question answering. For each task, we review their relevant characteristics, sub-tasks, and specific challenges (e.g., missing datasets for multi-document summarization, coherence in story generation, and complex reasoning for question answering). Additionally, we assess current approaches for evaluating text generation systems and ascertain problems with current metrics. Our investigation shows nine prominent challenges common to all tasks and sub-tasks in recent text generation publications: bias, reasoning, hallucinations, misuse, privacy, interpretability, transparency, datasets, and computing. We provide a detailed analysis of these challenges, their potential solutions, and which gaps still require further engagement from the community. This systematic literature review targets two main audiences: early career researchers in natural language processing looking for an overview of the field and promising research directions, as well as experienced researchers seeking a detailed view of tasks, evaluation methodologies, open challenges, and recent mitigation strategies.

computational linguistic, proceedings, text generation, (10 more...)

2405.15604

Country:

Europe > Germany > Lower Saxony > Gottingen (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(34 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.87)

Industry:

Information Technology > Security & Privacy (1.00)
Education (0.68)
Media (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
(4 more...)

Bouthors, Maxime, Crego, Josep, Yvon, François

Optimizing example selection for retrieval-augmented machine translation with translation memories

Retrieval-augmented machine translation leverages examples from a translation memory by retrieving similar instances. These examples are used to condition the predictions of a neural decoder. We aim to improve the upstream retrieval step and consider a fixed downstream edit-based model: the multi-Levenshtein Transformer. The task consists of finding a set of examples that maximizes the overall coverage of the source sentence. To this end, we rely on the theory of submodular functions and explore new algorithms to optimize this coverage. We evaluate the resulting performance gains for the machine translation task.

computational linguistic, machine translation, translation, (12 more...)

2405.1507

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
North America > United States > Maryland > Baltimore (0.04)
(15 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Boughorbel, Sabri, Parvez, MD Rizwan, Hawasly, Majd

Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis

Training LLMs in low resources languages usually utilizes data augmentation with machine translation (MT) from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions, the translated content carries over cultural biases, and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the free NLLB-3B MT model. We train a number of story generation models of sizes 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories, representing 1\% of the original training data, using a capable LLM in Arabic. We show using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic issues and cultural bias.

language model, tinystory, translation, (13 more...)

2405.14277

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Faroe Islands > Streymoy > Tórshavn (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

García-Romero, Cristian, Esplà-Gomis, Miquel, Sánchez-Martínez, Felipe

Smart Bilingual Focused Crawling of Parallel Documents

The availability of large text corpora is especially relevant in the field of machine translation where the state-of-the-art approach to neural machine translation (Vaswani et al., 2017) requires large amounts of parallel texts, i.e., texts in one language and their translation into another language. Parallel texts have also proven useful to build pre-trained language models with cross-lingual capabilities (Conneau et al., 2020; Kale et al., 2021; Reid and Artetxe, 2022), and in translation-memory tools (Bowker, 2002) to assist professional translators. The reduced availability of parallel documents, particularly for low-resource language pairs, is fuelling a growing interest in web mining, which has allowed to build some of the largest parallel corpora to date (El-Kishky et al., 2020; Bañón et al., 2020; Schwenk et al., 2021; Bañón et al., 2022). State-of-the-art tools for harvesting parallel data from the Internet, like Bitextor (Bañón et al., 2020; Esplà-Gomis et al., 2016) and ILSP-FocusedCrawler (Papavassiliou et al., 2018), use a web crawler to automatically browse the web and collect textual data. Web crawlers start with a list of seed URLs. The corresponding documents are downloaded and parsed, and any new URLs linked from them are added to a list of pending downloads.

computational linguistic, parallel document, proceedings, (15 more...)

2405.14779

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Berlin (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(15 more...)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Exploring Alignment in Shared Cross-lingual Spaces

Mousi, Basel, Durrani, Nadir, Dalvi, Fahim, Hawasly, Majd, Abdelali, Ahmed

Despite their remarkable ability to capture linguistic nuances across diverse languages, questions persist regarding the degree of alignment between languages in multilingual embeddings. Drawing inspiration from research on high-dimensional representations in neural language models, we employ clustering to uncover latent concepts within multilingual models. Our analysis focuses on quantifying the \textit{alignment} and \textit{overlap} of these concepts across various languages within the latent space. To this end, we introduce two metrics \CA{} and \CO{} aimed at quantifying these aspects, enabling a deeper exploration of multilingual embeddings. Our study encompasses three multilingual models (\texttt{mT5}, \texttt{mBERT}, and \texttt{XLM-R}) and three downstream tasks (Machine Translation, Named Entity Recognition, and Sentiment Analysis). Key findings from our analysis include: i) deeper layers in the network demonstrate increased cross-lingual \textit{alignment} due to the presence of language-agnostic concepts, ii) fine-tuning of the models enhances \textit{alignment} within the latent space, and iii) such task-specific calibration helps in explaining the emergence of zero-shot capabilities in the models.\footnote{The code is available at \url{https://github.com/baselmousi/multilingual-latent-concepts}}

alignment, computational linguistic, latent space, (14 more...)

2405.14535

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(11 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

Improving Gloss-free Sign Language Translation by Reducing Representation Density

Ye, Jinhui, Wang, Xing, Jiao, Wenxiang, Liang, Junwei, Xiong, Hui

Gloss-free sign language translation (SLT) aims to develop well-performing SLT systems with no requirement for the costly gloss annotations, but currently still lags behind gloss-based approaches significantly. In this paper, we identify a representation density problem that could be a bottleneck in restricting the performance of gloss-free SLT. Specifically, the representation density problem describes that the visual representations of semantically distinct sign gestures tend to be closely packed together in feature space, which makes gloss-free methods struggle with distinguishing different sign gestures and suffer from a sharp performance drop. To address the representation density problem, we introduce a simple but effective contrastive learning strategy, namely SignCL, which encourages gloss-free models to learn more discriminative feature representation in a self-supervised manner. Our experiments demonstrate that the proposed SignCL can significantly reduce the representation density and improve performance across various translation frameworks. Specifically, SignCL achieves a significant improvement in BLEU score for the Sign Language Transformer and GFSLT-VLP on the CSL-Daily dataset by 39% and 46%, respectively, without any increase of model parameters. Compared to Sign2GPT, a state-of-the-art method based on large-scale pre-trained vision and language models, SignCL achieves better performance with only 35% of its parameters.

language translation, representation density, translation, (13 more...)

2405.14312

Country:

Asia > China > Guangdong Province > Guangzhou (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Instruction Tuning With Loss Over Instructions

Shi, Zhengyan, Yang, Adam X., Wu, Bin, Aitchison, Laurence, Yilmaz, Emine, Lipani, Aldo

Instruction tuning plays a crucial role in shaping the outputs of language models (LMs) to desired styles. In this work, we propose a simple yet effective method, Instruction Modelling (IM), which trains LMs by applying a loss function to the instruction and prompt part rather than solely to the output part. Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can effectively improve the LM performance on both NLP tasks (e.g., MMLU, TruthfulQA, and HumanEval) and open-ended generation benchmarks (e.g., MT-Bench and AlpacaEval). Remarkably, in the most advantageous case, IM boosts model performance on AlpacaEval 1.0 by over 100%. We identify two key factors influencing the effectiveness of IM: (1) The ratio between instruction length and output length in the training data; and (2) The number of training examples. We observe that IM is especially beneficial when trained on datasets with lengthy instructions paired with brief outputs, or under the Superficial Alignment Hypothesis (SAH) where a small amount of training examples are used for instruction tuning. Further analysis substantiates our hypothesis that the improvement can be attributed to reduced overfitting to instruction tuning datasets. Our work provides practical guidance for instruction tuning LMs, especially in low-resource scenarios.

computational linguistic, dataset, instruction, (15 more...)

2405.14394

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMay-22-2024

A Survey on Multi-modal Machine Translation: Tasks, Methods and Challenges

Shen, Huangjun, Shao, Liangying, Li, Wenbo, Lan, Zhibin, Liu, Zhanyu, Su, Jinsong

In recent years, multi-modal machine translation has attracted significant interest in both academia and industry due to its superior performance. It takes both textual and visual modalities as inputs, leveraging visual context to tackle the ambiguities in source texts. In this paper, we begin by offering an exhaustive overview of 99 prior works, comprehensively summarizing representative studies from the perspectives of dominant models, datasets, and evaluation metrics. Afterwards, we analyze the impact of various factors on model performance and finally discuss the possible research directions for this task in the future. Over time, multi-modal machine translation has developed more types to meet diverse needs. Unlike previous surveys confined to the early stage of multi-modal machine translation, our survey thoroughly concludes these emerging types from different aspects, so as to provide researchers with a better understanding of its current state.

machine translation, proceedings, translation, (12 more...)

2405.12669

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
(43 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)