AITopics

2408.13432

Country:

Asia > Taiwan (0.04)
North America > United States > Iowa (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

De Meulder, Maartje, Van Landuyt, Davy, Omardeen, Rehana

Lessons in co-creation: the inconvenient truths of inclusive sign language technology development

arXiv.org Artificial IntelligenceAug-23-2024

In the era of AI-driven language technologies, there is a growing demand for the participation and leadership of deaf communities in sign language technology development, often framed as co-creation. This paper, developed through collaborative and iterative dialogue between the authors with data from informal participant observations, examines the involvement of the European Union of the Deaf in two EU Horizon 2020 projects, EASIER and SignON. These projects aimed to develop mobile translation applications between signed and spoken languages, bringing together predominantly hearing, non-signing technology experts with predominantly hearing sign language academics and organizations representing deaf end users in large multi-partner consortia. While co-creation is sometimes presented as the best or required way to do research or even as emancipatory, it frequently masks systemic issues of power imbalances and tokenism. Drawing from EUD's experiences of these projects, we highlight several inconvenient truths of co-creation, and propose seven lessons for future initiatives: recognizing deaf partners' invisible labour as work, managing expectations about technologies, cripping co-creation processes, exploring alternative methods to mitigate co-creation fatigue, seeking intersectional feedback, ensuring co-creation is not just virtue signalling, and fostering deaf leadership in AI sign language research. We argue for co-creation as a transformative activity that fundamentally alters the status quo and levels the playing field. This necessitates increasing the number of deaf researchers and enhancing AI literacy among deaf communities. Without these critical transformative actions, co-creation risks merely paying lip service to deaf communities.

eud, focus group, participant, (15 more...)

2408.13171

Country:

Europe > Netherlands (0.04)
Europe > Belgium > Flanders (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(14 more...)

Genre:

Research Report (0.82)
Questionnaire & Opinion Survey (0.70)

Industry:

Health & Medicine (1.00)
Education > Curriculum > Subject-Specific Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

arXiv.org Artificial IntelligenceAug-22-2024

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

Iyer, Vivek, Malik, Bhavitvya, Stepachev, Pavel, Chen, Pinzhen, Haddow, Barry, Birch, Alexandra

Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource translation still lags significantly behind Neural Machine Translation (NMT) models. In this paper, we explore what it would take to adapt LLMs for low-resource settings. In particular, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has been shown to be less important for MT using LLMs than in previous MT research. Similarly, diversity during SFT has been shown to promote significant transfer in LLMs across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both of these considerations: a) parallel data is critical during both pretraining and SFT, and b) diversity tends to cause interference, not transfer. Our experiments, conducted with 3 LLMs across 2 low-resourced language groups - indigenous American and North-East Indian - reveal consistent patterns in both cases, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve lower-resource languages.

parallel data, proceedings, translation, (14 more...)

2408.1278

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(7 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Gupta, Deepanshu, Latorre, Javier

Positional Description for Numerical Normalization

arXiv.org Artificial IntelligenceAug-22-2024

We present a Positional Description Scheme (PDS) tailored for digit sequences, integrating placeholder value information for each digit. Given the structural limitations of subword tokenization algorithms, language models encounter critical Text Normalization (TN) challenges when handling numerical tasks. Our schema addresses this challenge through straightforward pre-processing, preserving the model architecture while significantly simplifying number normalization, rendering the problem tractable. This simplifies the task and facilitates more compact production-ready models capable of learning from smaller datasets. Furthermore, our investigations reveal that PDS enhances the arithmetic processing capabilities of language models, resulting in a relative accuracy improvement of 23% to 51% on complex arithmetic tasks. We demonstrate that PDS effectively mitigates fatal numerical normalization errors in neural models, requiring only a modest amount of training data without rule-based Finite State Transducers (FST). We demonstrate that PDS is essential for both the Text-To-Speech and Speech Recognition text processing, enabling effective TN under production constraints.

dataset, normalization, text normalization, (14 more...)

2408.1243

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Europe > Germany > Berlin (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Ali, Felermino D. M. Antonio, Cardoso, Henrique Lopes, Sousa-Silva, Rui

Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.

artificial intelligence, machine translation, natural language, (15 more...)

2408.11457

Country:

Europe > Portugal > Porto > Porto (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Pennsylvania (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Finkelstein, Mara, Vilar, David, Freitag, Markus

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

data outperform traditional web-crawled data, llm-generated high-quality, newspalm mbr and qe dataset

Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT'23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT'23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM's strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.

2408.06537

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (0.60)

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

Koneru, Sai, Huck, Matthias, Exel, Miriam, Niehues, Jan

Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality\footnote{We will release the code upon paper acceptance.}.

computational linguistic, translation, translation quality, (13 more...)

2408.11327

Country:

Asia > India > Karnataka > Bengaluru (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Cognetta, Marco, Zouhar, Vilém, Okazaki, Naoaki

Distributional Properties of Subword Regularization

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

dropout, tokenization, tokenizer, (14 more...)

2408.11443

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany > Berlin (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Shahnazaryan, Lia, Beloucif, Meriem

Defining Boundaries: The Impact of Domain Specification on Cross-Language and Cross-Domain Transfer in Machine Translation

Recent advancements in neural machine translation (NMT) have revolutionized the field, yet the dependency on extensive parallel corpora limits progress for low-resource languages. Cross-lingual transfer learning offers a promising solution by utilizing data from high-resource languages but often struggles with in-domain NMT. In this paper, we investigate three pivotal aspects: enhancing the domain-specific quality of NMT by fine-tuning domain-relevant data from different language pairs, identifying which domains are transferable in zero-shot scenarios, and assessing the impact of language-specific versus domain-specific factors on adaptation effectiveness. Using English as the source language and Spanish for fine-tuning, we evaluate multiple target languages including Portuguese, Italian, French, Czech, Polish, and Greek. Our findings reveal significant improvements in domain-specific translation quality, especially in specialized fields such as medical, legal, and IT, underscoring the importance of well-defined domain data and transparency of the experiment setup in in-domain transfer learning.

computational linguistic, domain adaptation, language pair, (12 more...)

2408.11926

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
(11 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry:

Law (0.36)
Education (0.30)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Gumma, Varun, Chitale, Pranjal A., Bali, Kalika

On the Interchangeability of Positional Embeddings in Multilingual Neural Machine Translation Models

Standard Neural Machine Translation (NMT) models have traditionally been trained with Sinusoidal Positional Embeddings (PEs), which are inadequate for capturing long-range dependencies and are inefficient for long-context or document-level translation. In contrast, state-of-the-art large language models (LLMs) employ relative PEs, demonstrating superior length generalization. This work explores the potential for efficiently switching the Positional Embeddings of pre-trained NMT models from absolute sinusoidal PEs to relative approaches such as RoPE and ALiBi. Our findings reveal that sinusoidal PEs can be effectively replaced with RoPE and ALiBi with negligible or no performance loss, achieved by fine-tuning on a small fraction of high-quality data. Additionally, models trained without Positional Embeddings (NoPE) are not a viable solution for Encoder-Decoder architectures, as they consistently under-perform compared to models utilizing any form of Positional Embedding. Furthermore, even a model trained from scratch with these relative PEs slightly under-performs a fine-tuned model, underscoring the efficiency and validity of our hypothesis.

computational linguistic, positional embedding, proceedings, (13 more...)

2408.11382

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > California (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)