AITopics

2412.09993

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Joglekar, Advait, Umesh, Srinivasan

Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

arXiv.org Artificial IntelligenceDec-12-2024

Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the potential for generalizing to out-of-domain translation tasks by improving the baseline by over 2 BLEU on average for these Indian languages on the Flores+ benchmark. We are pleased to release our model and dataset via this link: https://huggingface.co/SPRINGLab.

artificial intelligence, machine translation, natural language, (18 more...)

2412.09025

Country:

Asia > India (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceDec-12-2024

PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model

Lauc, Davor

This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.

artificial intelligence, machine learning, natural language, (21 more...)

2412.09102

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Croatia > Zagreb County > Zagreb (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Lai, Huiyuan, Ploeger, Esther, van Noord, Rik, Toral, Antonio

Multi-perspective Alignment for Increasing Naturalness in Neural Machine Translation

arXiv.org Artificial IntelligenceDec-11-2024

Neural machine translation (NMT) systems amplify lexical biases present in their training data, leading to artificially impoverished language in output translations. These language-level characteristics render automatic translations different from text originally written in a language and human translations, which hinders their usefulness in for example creating evaluation datasets. Attempts to increase naturalness in NMT can fall short in terms of content preservation, where increased lexical diversity comes at the cost of translation accuracy. Inspired by the reinforcement learning from human feedback framework, we introduce a novel method that rewards both naturalness and content preservation. We experiment with multiple perspectives to produce more natural translations, aiming at reducing machine and human translationese. We evaluate our method on English-to-Dutch literary translation, and find that our best model produces translations that are lexically richer and exhibit more properties of human-written language, without loss in translation accuracy.

machine learning, natural language, translation, (18 more...)

2412.08473

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(20 more...)

Genre: Research Report (0.84)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Artificial IntelligenceDec-11-2024

From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Finkelstein, Mara, Deutsch, Dan, Riley, Parker, Juraska, Juraj, Kovacs, Geza, Freitag, Markus

As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.

icl example, translation, wmt, (14 more...)

2411.15387

Country: North America > Mexico > Mexico City > Mexico City (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Mohamed, Naira Abdou, Erraji, Zakarya, Bahafid, Abdessalam, Benelallam, Imade

Harnessing Transfer Learning from Swahili: Advancing Solutions for Comorian Dialects

arXiv.org Artificial IntelligenceDec-9-2024

If today some African languages like Swahili have enough resources to develop high-performing Natural Language Processing (NLP) systems, many other languages spoken on the continent are still lacking such support. For these languages, still in their infancy, several possibilities exist to address this critical lack of data. Among them is Transfer Learning, which allows low-resource languages to benefit from the good representation of other languages that are similar to them. In this work, we adopt a similar approach, aiming to pioneer NLP technologies for Comorian, a group of four languages or dialects belonging to the Bantu family. Our approach is initially motivated by the hypothesis that if a human can understand a different language from their native language with little or no effort, it would be entirely possible to model this process on a machine. To achieve this, we consider ways to construct Comorian datasets mixed with Swahili. One thing to note here is that in terms of Swahili data, we only focus on elements that are closest to Comorian by calculating lexical distances between candidate and source data. We empirically test this hypothesis in two use cases: Automatic Speech Recognition (ASR) and Machine Translation (MT). Our MT model achieved ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.6826, 0.42, and 0.6532, respectively, while our ASR system recorded a WER of 39.50\% and a CER of 13.76\%. This research is crucial for advancing NLP in underrepresented languages, with potential to preserve and promote Comorian linguistic heritage in the digital age.

machine learning, natural language, swahili, (16 more...)

2412.12143

Country:

Africa > Comoros (0.05)
Indian Ocean (0.05)
Africa > Middle East > Morocco > Rabat-Salé-Kénitra Region > Rabat (0.05)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research

Zhong, Tianyang, Yang, Zhenyuan, Liu, Zhengliang, Zhang, Ruidong, Liu, Yiheng, Sun, Haiyang, Pan, Yi, Li, Yiwei, Zhou, Yifan, Jiang, Hanqi, Chen, Junhao, Liu, Tianming

Importance and Endangerment of Low-Resource Languages in the Global Linguistic Ecology The linguistic landscape of the world constitutes a complex tapestry interwoven with a rich diversity of languages, each strand epitomizing a distinctive cultural, historical, and social identity. This global linguistic diversity forms a foundational pillar of human civilization, cultivating an array of perspectives and worldviews that enhance our collective intellectual legacy. Among these, low-resource languages occupy a particularly crucial position, not merely as modes of communication but as repositories of distinctive cultural knowledge, historical narratives, and worldviews. These languages, frequently spoken by smaller communities, are essential to the preservation of cultural heritage and the transmission of indigenous knowledge systems. However, the global linguistic landscape is presently undergoing an extraordinary crisis, with lowresource languages among the most threatened. The swift vanishing of these languages is of serious concern, highlighted by concerning data and studies. It is estimated, for example, that around 40% of the world's 7,000 languages face extinction, with numerous low-resource languages having fewer than 1,000 speakers [94].

large language model, machine learning, natural language, (18 more...)

2412.04497

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Oceania > Australia (0.04)
(14 more...)

Genre:

Overview (1.00)
Instructional Material (0.93)
Research Report > Promising Solution (0.67)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Chang, Ke-Ching, Chen, Chung-Chi, Yen, An-Zi

Paraphrase-Aligned Machine Translation

Large Language Models (LLMs) have demonstrated significant capabilities in machine translation. However, their translation quality is sometimes questioned, as the generated outputs may deviate from expressions typically used by native speakers. These deviations often arise from differences in sentence structure between language systems. To address this issue, we propose ParaAlign Translator, a method that fine-tunes LLMs to paraphrase sentences, aligning their structures with those of the target language systems. This approach improves the performance of subsequent translations. Experimental results demonstrate that the proposed method enhances the LLaMA-3-8B model's performance in both resource-rich and low-resource scenarios and achieves parity with or surpassing the much larger LLaMA-3-70B model.

large language model, machine learning, natural language, (15 more...)

2412.05916

Country:

Asia > Singapore (0.05)
North America > Mexico > Mexico City > Mexico City (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Integrative Decoding: Improve Factuality via Implicit Self-consistency

Cheng, Yi, Liang, Xiao, Gong, Yeyun, Xiao, Wen, Wang, Song, Zhang, Yuji, Hou, Wenjun, Xu, Kaishuai, Liu, Wenge, Li, Wenjie, Jiao, Jian, Chen, Qi, Cheng, Peng, Xiong, Wayne

Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the TruthfulQA (+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.

large language model, machine learning, natural language, (21 more...)

2410.01556

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > United Kingdom > England > Greater London > London > Wimbledon (0.05)
Europe > Russia (0.04)
(13 more...)

Genre:

Personal (1.00)
Research Report > New Finding (0.45)

Industry:

Leisure & Entertainment > Sports > Tennis (1.00)
Information Technology (1.00)
Health & Medicine (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.92)

Zhang, Boyu, Le, Triet H. M., Babar, M. Ali

MVD: A Multi-Lingual Software Vulnerability Detection Framework

Software vulnerabilities can result in catastrophic cyberattacks that increasingly threaten business operations. Consequently, ensuring the safety of software systems has become a paramount concern for both private and public sectors. Recent literature has witnessed increasing exploration of learning-based approaches for software vulnerability detection. However, a key limitation of these techniques is their primary focus on a single programming language, such as C/C++, which poses constraints considering the polyglot nature of modern software projects. Further, there appears to be an oversight in harnessing the synergies of vulnerability knowledge across varied languages, potentially underutilizing the full capabilities of these methods. To address the aforementioned issues, we introduce MVD - an innovative multi-lingual vulnerability detection framework. This framework acquires the ability to detect vulnerabilities across multiple languages by concurrently learning from vulnerability data of various languages, which are curated by our specialized pipeline. We also incorporate incremental learning to enable the detection capability of MVD to be extended to new languages, thus augmenting its practical utility. Extensive experiments on our curated dataset of more than 11K real-world multi-lingual vulnerabilities substantiate that our framework significantly surpasses state-of-the-art methods in multi-lingual vulnerability detection by 83.7% to 193.6% in PR-AUC. The results also demonstrate that MVD detects vulnerabilities well for new languages without compromising the detection performance of previously trained languages, even when training data for the older languages is unavailable. Overall, our findings motivate and pave the way for the prediction of multi-lingual vulnerabilities in modern software systems.

machine learning, natural language, programming language, (18 more...)

2412.06166

Country:

Oceania > Australia > South Australia > Adelaide (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)