AITopics

2411.08889

Country: North America > United States > Virginia > Montgomery County > Blacksburg (0.05)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Information Technology > Services (0.71)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

TehraniJamsaz, Ali, Bhattacharjee, Arijit, Chen, Le, Ahmed, Nesreen K., Yazdanbakhsh, Amir, Jannesari, Ali

CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

arXiv.org Artificial IntelligenceOct-27-2024

Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions. CodeRosetta is evaluated on C++ to CUDA and Fortran to C++ translation tasks. It uses a customized learning framework with tailored pretraining and training objectives to effectively capture both code semantics and parallel structural nuances, enabling bidirectional translation. Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to CUDA translation by 2.9 BLEU and 1.72 CodeBLEU points while improving compilation accuracy by 6.05%. Compared to general closed-source LLMs, our method improves C++ to CUDA translation by 22.08 BLEU and 14.39 CodeBLEU, with 2.75% higher compilation accuracy. Finally, CodeRosetta exhibits proficiency in Fortran to parallel C++ translation, marking it, to our knowledge, as the first encoder-decoder model for this complex task, improving CodeBLEU by at least 4.63 points compared to closed-source and open-code LLMs.

large language model, machine learning, programming language, (21 more...)

2410.20527

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > United States > Iowa > Story County > Ames (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(2 more...)

Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

Di, Donglin, Zhang, Weinan, Zhang, Yue, Wang, Fanglin

Making use of off-the-shelf resources of resource-rich languages to transfer knowledge for low-resource languages raises much attention recently. The requirements of enabling the model to reach the reliable performance lack well guided, such as the scale of required annotated data or the effective framework. To investigate the first question, we empirically investigate the cost-effectiveness of several methods to train the intent classification and slot-filling models for Indonesia (ID) from scratch by utilizing the English data. Confronting the second challenge, we propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF), composed by ``BiCF Mixing'', ``Latent Space Refinement'' and ``Joint Decoder'', respectively, to tackle the obstacle of lacking low-resource language dialogue data. Extensive experiments demonstrate our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data. We release a large-scale fine-labeled dialogue dataset (ID-WOZ) and ID-BERT of Indonesian for further research.

artificial intelligence, machine learning, natural language, (15 more...)

2410.1843

Country:

Asia > Indonesia (0.24)
Asia > China > Heilongjiang Province > Harbin (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs

Zhang, Ran, Zhao, Wei, Eger, Steffen

Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus comprising multiple verified human translations and outputs from 9 MT systems, which totals over 2k paragraphs and includes 13k annotated sentences across four language pairs, costing 4.5k Euro. This corpus enables us to (i) examine the consistency and adequacy of multiple annotation schemes, (ii) compare evaluations by students and professionals, and (iii) assess the effectiveness of LLM-based metrics. We find that Multidimensional Quality Metrics (MQM), as the de facto standard in non-literary human MT evaluation, is inadequate for literary translation: While Best-Worst Scaling (BWS) with students and Scalar Quality Metric (SQM) with professional translators prefer human translations at rates of ~82% and ~94%, respectively, MQM with student annotators prefers human professional translations over the translations of the best-performing LLMs in only ~42% of cases. While automatic metrics generally show a moderate correlation with human MQM and SQM, they struggle to accurately identify human translations, with rates of at most ~20%. Our overall evaluation indicates that human professional translations consistently outperform LLM translations, where even the most recent LLMs tend to produce more literal and less diverse translations compared to human translations. However, newer LLMs such as GPT-4o perform substantially better than older ones.

large language model, machine learning, translation, (19 more...)

2410.18697

Country:

Asia > Singapore (0.04)
Europe > United Kingdom > England > South Yorkshire > Sheffield (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
(11 more...)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Nachesa, Maya K., Niculae, Vlad

We Augmented Whisper With kNN and You Won't Believe What Came Next

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

datastore, machine learning, natural language, (20 more...)

2410.1885

Country:

Europe > Belgium (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Ramos, Rita, Chimoto, Everlyn Asiko, ter Hoeve, Maartje, Schluter, Natalie

GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning

We introduce GrammaMT, a grammatically-aware prompting approach for machine translation that uses Interlinear Glossed Text (IGT), a common form of linguistic description providing morphological and lexical annotations for source sentences. GrammaMT proposes three prompting strategies: gloss-shot, chain-gloss and model-gloss. All are training-free, requiring only a few examples that involve minimal effort to collect, and making them well-suited for low-resource setups. Experiments show that GrammaMT enhances translation performance on open-source instruction-tuned LLMs for various low- to high-resource languages across three benchmarks: (1) the largest IGT corpus, (2) the challenging 2023 SIGMORPHON Shared Task data over endangered languages, and (3) even in an out-of-domain setting with FLORES. Moreover, ablation studies reveal that leveraging gloss resources could substantially boost MT performance (by over 17 BLEU points) if LLMs accurately generate or access input sentence glosses.

large language model, machine learning, translation, (18 more...)

2410.18702

Country:

Asia > Singapore (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.72)
Research Report > Experimental Study (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Wanzare, Lilian, Okutoyi, Joel, Kang'ahi, Maurine, Ayere, Mildred

Kenyan Sign Language (KSL) Dataset: Using Artificial Intelligence (AI) in Bridging Communication Barrier among the Deaf Learners

arXiv.org Artificial IntelligenceOct-23-2024

Kenyan Sign Language (KSL) is the primary language used by the deaf community in Kenya. It is the medium of instruction from Pre-primary 1 to university among deaf learners, facilitating their education and academic achievement. Kenyan Sign Language is used for social interaction, expression of needs, making requests and general communication among persons who are deaf in Kenya. However, there exists a language barrier between the deaf and the hearing people in Kenya. Thus, the innovation on AI4KSL is key in eliminating the communication barrier. Artificial intelligence for KSL is a two-year research project (2023-2024) that aims to create a digital open-access AI of spontaneous and elicited data from a representative sample of the Kenyan deaf community. The purpose of this study is to develop AI assistive technology dataset that translates English to KSL as a way of fostering inclusion and bridging language barriers among deaf learners in Kenya. Specific objectives are: Build KSL dataset for spoken English and video recorded Kenyan Sign Language and to build transcriptions of the KSL signs to a phonetic-level interface of the sign language. In this paper, the methodology for building the dataset is described. Data was collected from 48 teachers and tutors of the deaf learners and 400 learners who are Deaf. Participants engaged mainly in sign language elicitation tasks through reading and singing. Findings of the dataset consisted of about 14,000 English sentences with corresponding KSL Gloss derived from a pool of about 4000 words and about 20,000 signed KSL videos that are either signed words or sentences. The second level of data outcomes consisted of 10,000 split and segmented KSL videos. The third outcome of the dataset consists of 4,000 transcribed words into five articulatory parameters according to HamNoSys system.

artificial intelligence, machine translation, natural language, (13 more...)

2410.18295

Country:

Asia > Pakistan (0.04)
North America > United States > Hawaii (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(8 more...)

Genre:

Overview (0.68)
Research Report (0.64)
Instructional Material (0.46)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Wang, Xinyu, Zhang, Wenbo, Rajtmajer, Sarah

Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey

arXiv.org Artificial IntelligenceOct-23-2024

In today's global digital landscape, misinformation transcends linguistic boundaries, posing a significant challenge for moderation systems. While significant advances have been made in misinformation detection, the focus remains largely on monolingual high-resource contexts, with low-resource languages often overlooked. This survey aims to bridge that gap by providing a comprehensive overview of the current research on low-resource language misinformation detection in both monolingual and multilingual settings. We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts. We also examine emerging approaches, such as language-agnostic models and multi-modal techniques, while emphasizing the need for improved data collection practices, interdisciplinary collaboration, and stronger incentives for socially responsible AI research. Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.

large language model, machine learning, natural language, (18 more...)

2410.1839

Country:

Asia > India (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
South America > Brazil (0.04)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(5 more...)

Deoghare, Sourabh, Kanojia, Diptesh, Bhattacharyya, Pushpak

Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

arXiv.org Artificial IntelligenceOct-23-2024

This exploratory study investigates the potential of multilingual Automatic Post-Editing (APE) systems to enhance the quality of machine translations for low-resource Indo-Aryan languages. Focusing on two closely related language pairs, English-Marathi and English-Hindi, we exploit the linguistic similarities to develop a robust multilingual APE model. To facilitate cross-linguistic transfer, we generate synthetic Hindi-Marathi and Marathi-Hindi APE triplets. Additionally, we incorporate a Quality Estimation (QE)-APE multi-task learning framework. While the experimental results underline the complementary nature of APE and QE, we also observe that QE-APE multitask learning facilitates effective domain adaptation. Our experiments demonstrate that the multilingual APE models outperform their corresponding English-Hindi and English-Marathi single-pair models by $2.5$ and $2.39$ TER points, respectively, with further notable improvements over the multilingual APE model observed through multi-task learning ($+1.29$ and $+1.44$ TER points), data augmentation ($+0.53$ and $+0.45$ TER points) and domain adaptation ($+0.35$ and $+0.45$ TER points). We release the synthetic data, code, and models accrued during this study publicly at https://github.com/cfiltnlp/Multilingual-APE.

artificial intelligence, machine learning, natural language, (16 more...)

2410.17973

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
Asia > Singapore (0.04)
(15 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Chiaranaipanich, Jirat, Hanmatheekuna, Naiyarat, Sawatphol, Jitkapat, Tiankanon, Krittamate, Kinchagawat, Jiramet, Chinkamol, Amrest, Pengpun, Parinthapat, Ittichaiwong, Piyalitt, Limkonchotiwat, Peerat

Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation ?

arXiv.org Artificial IntelligenceOct-22-2024

Large language models (LLMs) perform well on common tasks but struggle with generalization in low-resource and low-computation settings. We examine this limitation by testing various LLMs and specialized translation models on English-Thai machine translation and code-switching datasets. Our findings reveal that under more strict computational constraints, such as 4-bit quantization, LLMs fail to translate effectively. In contrast, specialized models, with comparable or lower computational requirements, consistently outperform LLMs. This underscores the importance of specialized models for maintaining performance under resource constraints.

large language model, machine learning, translation, (20 more...)

2410.17145

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)