Goto

Collaborating Authors

 mbart



Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

Oni, Mangsura Kabir, Prama, Tabia Tanzin

arXiv.org Artificial Intelligence

WORK Although the findings highlight the effectiveness of fine - tuned transformer models for Bengali - Sylheti translation, several limitations remain. The dataset size (5,002 parallel sentences) restricts the models' capacity to generalize across diverse syntactic structures, stylistic variations, and domain - specific expressions. In addition, orthographic inconsistencies in Sylheti introduce noise, leading to training instability, particularly in models like mBART - 50. Another limitation is the reliance on automatic evaluation metrics such as BLEU and chrF, which may not fully capture the linguistic richness or cultural nuance of Sylheti. Future research should therefore focus on expanding the datas et through community - driven contributions and data augmentation strategies. Incorporating orthographic normalization could improve consistency and reduce variability during training. Hybrid approaches that combine the strengths of pre - trained LLMs with fin e - tuned NMT models may also enhance translation robustness in low - resource settings. Finally, incorporating human evaluation will provide a more comprehensive assessment of translation adequacy, fluency, and cultural alignment.



Two Spelling Normalization Approaches Based on Large Language Models

Domingo, Miguel, Casacuberta, Francisco

arXiv.org Artificial Intelligence

The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.


Multilingual Open QA on the MIA Shared Task

Yarrabelly, Navya, Mittal, Saloni, Todi, Ketan, Hasegawa, Kimihiro

arXiv.org Artificial Intelligence

Cross-lingual information retrieval (CLIR) ~\cite{shi2021cross, asai2021one, jiang2020cross} for example, can find relevant text in any language such as English(high resource) or Telugu (low resource) even when the query is posed in a different, possibly low-resource, language. In this work, we aim to develop useful CLIR models for this constrained, yet important, setting where we do not require any kind of additional supervision or labelled data for retrieval task and hence can work effectively for low-resource languages. \par We propose a simple and effective re-ranking method for improving passage retrieval in open question answering. The re-ranker re-scores retrieved passages with a zero-shot multilingual question generation model, which is a pre-trained language model, to compute the probability of the input question in the target language conditioned on a retrieved passage, which can be possibly in a different language. We evaluate our method in a completely zero shot setting and doesn't require any training. Thus the main advantage of our method is that our approach can be used to re-rank results obtained by any sparse retrieval methods like BM-25. This eliminates the need for obtaining expensive labelled corpus required for the retrieval tasks and hence can be used for low resource languages.


RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Kashid, Harshvivek, Bhattacharyya, Pushpak

arXiv.org Artificial Intelligence

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.


From Priest to Doctor: Domain Adaptaion for Low-Resource Neural Machine Translation

Marashian, Ali, Rice, Enora, Gessler, Luke, Palmer, Alexis, von der Wense, Katharina

arXiv.org Artificial Intelligence

Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.


Segment-Based Interactive Machine Translation for Pre-trained Models

Navarro, Angel, Casacuberta, Francisco

arXiv.org Artificial Intelligence

Pre-trained large language models (LLM) are starting to be widely used in many applications. In this work, we explore the use of these models in interactive machine translation (IMT) environments. In particular, we have chosen mBART (multilingual Bidirectional and Auto-Regressive Transformer) and mT5 (multilingual Text-to-Text Transfer Transformer) as the LLMs to perform our experiments. The system generates perfect translations interactively using the feedback provided by the user at each iteration. The Neural Machine Translation (NMT) model generates a preliminary hypothesis with the feedback, and the user validates new correct segments and performs a word correction--repeating the process until the sentence is correctly translated. We compared the performance of mBART, mT5, and a state-of-the-art (SoTA) machine translation model on a benchmark dataset regarding user effort, Word Stroke Ratio (WSR), Key Stroke Ratio (KSR), and Mouse Action Ratio (MAR). The experimental results indicate that mBART performed comparably with SoTA models, suggesting that it is a viable option for this field of IMT. The implications of this finding extend to the development of new machine translation models for interactive environments, as it indicates that some novel pre-trained models exhibit SoTA performance in this domain, highlighting the potential benefits of adapting these models to specific needs.


CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models Using Synthetic Back-Translation Data

Hong, Kung Yin, Han, Lifeng, Batista-Navarro, Riza, Nenadic, Goran

arXiv.org Artificial Intelligence

Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.


Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks

Chirkova, Nadezhda, Nikoulina, Vassilina

arXiv.org Artificial Intelligence

Zero-shot cross-lingual knowledge transfer enables a multilingual pretrained language model, finetuned on a task in one language, make predictions for this task in other languages. While being broadly studied for natural language understanding tasks, the described setting is understudied for generation. Previous works notice a frequent problem of generation in a wrong language and propose approaches to address it, usually using mT5 as a backbone model. In this work we compare various approaches proposed from the literature in unified settings, also including alternative backbone models, namely mBART and NLLB-200. We first underline the importance of tuning learning rate used for finetuning, which helps to substantially alleviate the problem of generation in the wrong language. Then, we show that with careful learning rate tuning, the simple full finetuning of the model acts as a very strong baseline and alternative approaches bring only marginal improvements. Finally, we find that mBART performs similarly to mT5 of the same size, and NLLB-200 can be competitive in some cases. Our final zero-shot models reach the performance of the approach based on data translation which is usually considered as an upper baseline for zero-shot cross-lingual transfer in generation.