AITopics

2407.10456

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(9 more...)

Genre: Research Report (1.00)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJul-15-2024

Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation

Doi, Kosuke, Ko, Yuka, Makinae, Mana, Sudoh, Katsuhito, Nakamura, Satoshi

This paper analyzes the features of monotonic translations, which follow the word order of the source language, in simultaneous interpreting (SI). Word order differences are one of the biggest challenges in SI, especially for language pairs with significant structural differences like English and Japanese. We analyzed the characteristics of chunk-wise monotonic translation (CMT) sentences using the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset and identified some grammatical structures that make monotonic translation difficult in English-Japanese SI. We further investigated the features of CMT sentences by evaluating the output from the existing speech translation (ST) and simultaneous speech translation (simulST) models on the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset as well as on existing test sets. The results indicate the possibility that the existing SI-based test set underestimates the model performance. The results also suggest that using CMT sentences as references gives higher scores to simulST models than ST models, and that using an offline-based test set to evaluate the simulST models underestimates the model performance.

cmt sentence, proceedings, translation, (14 more...)

2406.0894

Country:

Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Spain (0.04)
(18 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Kuwanto, Garry, Urua, Eno-Abasi E., Amuok, Priscilla Amondi, Muhammad, Shamsuddeen Hassan, Aremu, Anuoluwapo, Otiende, Verrah, Nanyanga, Loice Emma, Nyoike, Teresiah W., Akpan, Aniefon D., Udouboh, Nsima Ab, Archibong, Idongesit Udeme, Moses, Idara Effiong, Ige, Ifeoluwatayo A., Ajibade, Benjamin, Awokoya, Olumide Benjamin, Abdulmumin, Idris, Aliyu, Saminu Mohammad, Iro, Ruqayya Nasir, Ahmad, Ibrahim Said, Smith, Deontae, Michaels, Praise-EL, Adelani, David Ifeoluwa, Wijaya, Derry Tanti, Andy, Anietie

Mitigating Translationese in Low-resource Languages: The Storyboard Approach

arXiv.org Artificial IntelligenceJul-14-2024

Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences. Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text. We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency. Human annotators and quantitative metrics were used to assess translation quality. The results indicate a preference for text translation in terms of accuracy, while our method demonstrates worse accuracy but better fluency in the language focused.

english sentence, text translation, translation, (13 more...)

2407.10152

Country:

North America > United States (0.70)
Europe > Spain (0.04)
Europe > Latvia > Riga Municipality > Riga (0.04)
(13 more...)

Genre: Research Report > Experimental Study (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Bakagianni, Juli, Pouli, Kanella, Gavriilidou, Maria, Pavlopoulos, John

Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP

arXiv.org Artificial IntelligenceJul-13-2024

Natural Language Processing (NLP) research has traditionally been predominantly focused on English, driven by the availability of resources, the size of the research community, and market demands. Recently, there has been a noticeable shift towards multilingualism in NLP, recognizing the need for inclusivity and effectiveness across diverse languages and cultures. Monolingual surveys have the potential to complement the broader trend towards multilingualism in NLP by providing foundational insights and resources necessary for effectively addressing the linguistic diversity of global communication. However, monolingual NLP surveys are extremely rare in literature. This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys. Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks. We include a classification of Language Resources (LRs), according to their availability, and datasets, according to their annotation, to highlight publicly-available and machine-actionable LRs. By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022, providing a comprehensive overview of the current state and challenges of Greek NLP research. We discuss the progress of Greek NLP and outline encountered Greek LRs, classified by availability and usability. As we show, our proposed method helps avoid common pitfalls, such as data leakage and contamination, and to assess language support per NLP task. We consider this systematic literature review of Greek NLP an application of our method that showcases the benefits of a monolingual NLP survey. Similar applications could be regard the myriads of languages whose progress in NLP lags behind that of well-supported languages.

author availability ann, modern greek, nlp application, (16 more...)

2407.09861

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > Slovenia (0.04)
(23 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology (1.00)
Education > Curriculum > Subject-Specific Education (0.92)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(8 more...)

Paniv, Yurii, Chaplynskyi, Dmytro, Trynus, Nikita, Kyrylov, Volodymyr

Setting up the Data Printer with Improved English to Ukrainian Machine Translation

To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.

large language model, natural language, translation, (16 more...)

2404.15196

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
Oceania > Australia > New South Wales (0.04)
(12 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models

Jin, Linghao, An, Li, Ma, Xuezhe

Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of Chinese-English literature, which consists of 160 books with intricate discourse structures. Then, we propose a more pragmatic and challenging setting for context-aware translation, termed chapter-to-chapter (Ch2Ch) translation, and investigate the performance of commonly-used machine translation models under this setting. Furthermore, we introduce a potential approach of finetuning large language models (LLMs) within the domain of Ch2Ch literary translation, yielding impressive improvements over baselines. Through our comprehensive analysis, we unveil that literary translation under the Ch2Ch setting is challenging in nature, with respect to both model learning methods and translation decoding algorithms.

computational linguistic, dataset, translation, (13 more...)

2407.08978

Country:

North America > United States > California (0.14)
Asia > Singapore (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
(17 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Fayyazsanavi, Pooya, Anastasopoulos, Antonios, Košecká, Jana

Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on Gloss2Text translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities Figure 1: An example of ambiguity in sign language is improving significantly the performance of demonstrated by the gloss "BEWOELKT (CLOUDY)," state-of-the-art approaches. Through extensive which is represented in multiple translations within the experiments and ablation studies on the dataset. As shown, ambiguity may share the same meaning PHOENIX Weather 2014T dataset, our approach but differ in form, such as "wolken (cloudy)," or surpasses state-of-the-art performance where the gloss represents the concept meaning, such in Gloss2Text translation, indicating its efficacy as "unbeständig (unstable)." in addressing sign language translation and suggesting promising avenues for future research and development.

language translation, translation, translation task, (12 more...)

2407.01394

Genre: Research Report > Promising Solution (0.48)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

Heakl, Ahmed, Zaghloul, Youssef, Ali, Mennatullah, Hossam, Rania, Gomaa, Walid

Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{http://github.com/ahmedheakl/arazn-llm}}, Models: \url{http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e}.

egyptian arabic, machine translation, translation, (11 more...)

2406.1812

Country:

Africa > Middle East > Egypt > Alexandria Governorate > Alexandria (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Elmadany, AbdelRahim, Adebara, Ife, Abdul-Mageed, Muhammad

Toucan: Many-to-Many Translation for 150 African Language Pairs

We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.

african language, dataset, translation, (14 more...)

2407.04796

Country:

Africa > Niger (0.07)
Africa > Ethiopia (0.05)
Africa > Uganda (0.05)
(49 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Frontull, Samuel, Moser, Georg

Rule-Based, Neural and LLM Back-Translation: Comparative Insights from a Variant of Ladin

arXiv.org Artificial IntelligenceJul-11-2024

This paper explores the impact of different back-translation approaches on machine translation for Ladin, specifically the Val Badia variant. Given the limited amount of parallel data available for this language (only 18k Ladin-Italian sentence pairs), we investigate the performance of a multilingual neural machine translation model fine-tuned for Ladin-Italian. In addition to the available authentic data, we synthesise further translations by using three different models: a fine-tuned neural model, a rule-based system developed specifically for this language pair, and a large language model. Our experiments show that all approaches achieve comparable translation quality in this low-resource scenario, yet round-trip translations highlight differences in model performance.

ladin, monolingual bt, translation, (12 more...)

2407.08819

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
Europe > Finland > Uusimaa > Helsinki (0.05)
(15 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)