AITopics | parallel corpus

Collaborating Authors

parallel corpus

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Accent Placement Models for Rigvedic Sanskrit Text

P, Akhil Rajeev, Kulkarni, Annarao

arXiv.org Artificial IntelligenceDec-1-2025

The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.23088

Country:

Asia > Nepal (0.15)
North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation

Hasan, Kazi Reyazul, Musarrat, Mubasshira, Islam, A. B. M. Alim Al, Adnan, Muhammad Abdullah

arXiv.org Artificial IntelligenceNov-6-2025

Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2511.03498

Country: Asia > Bangladesh (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Information Technology > Security & Privacy (0.68)
Health & Medicine (0.68)
Education > Curriculum > Subject-Specific Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Mohsin, Ayesha Afroza, Ahsan, Mashrur, Maliyat, Nafisa, Maria, Shanta, Raiyan, Syed Rifat, Mahmud, Hasan, Hasan, Md Kamrul

arXiv.org Artificial IntelligenceNov-4-2025

Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains under-explored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (Co T) prompting to generate detoxified sentences. To support this effort, we construct BANGLANIRTOX, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BANGLANIRTOX dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with Co T prompting significantly enhance the quality and consistency of Bengali text detoxification. Warning: This paper contains examples of toxic and offensive language.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.01512

Country: Asia > Bangladesh (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Shen, Yingli, Lai, Wen, Wang, Shuo, Gao, Ge, Luo, Kangyang, Fraser, Alexander, Sun, Maosong

arXiv.org Artificial IntelligenceOct-22-2025

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

computational linguistic, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2505.14045

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Energy (0.92)
Education (0.68)
Government (0.67)
Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations

Lu, Qiuyang, Shen, Fangjian, Tang, Zhengkai, Liu, Qiang, Cheng, Hexuan, Liu, Hui, Wen, Wushao

arXiv.org Artificial IntelligenceSep-22-2025

The quality and accessibility of multilingual datasets are crucial for advancing machine translation. However, previous corpora built from United Nations documents have suffered from issues such as opaque process, difficulty of reproduction, and limited scale. To address these challenges, we introduce a complete end-to-end solution, from data acquisition via web scraping to text alignment. The entire process is fully reproducible, with a minimalist single-machine example and optional distributed computing steps for scalability. At its core, we propose a new Graph-Aided Paragraph Alignment (GAPA) algorithm for efficient and flexible paragraph-level alignment. The resulting corpus contains over 713 million English tokens, more than doubling the scale of prior work. To the best of our knowledge, this represents the largest publicly available parallel corpus composed entirely of human-translated, non-AI-generated content. Our code and corpus are accessible under the MIT License.

artificial intelligence, natural language, paragraph, (16 more...)

arXiv.org Artificial Intelligence

2509.15789

Country: Europe (0.93)

Genre: Research Report (0.40)

Industry: Government > Intergovernmental Programs (0.63)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Nagata, Masaaki, Chousa, Katsuki, Yasuda, Norihito

arXiv.org Artificial IntelligenceAug-25-2025

We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.

application, artificial intelligence, natural language, (14 more...)

arXiv.org Artificial Intelligence

2508.16303

Country:

Asia > Japan > Honshū (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (0.56)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Pivot Language for Low-Resource Machine Translation

Talwar, Abhimanyu, Laasri, Julien

arXiv.org Artificial IntelligenceMay-22-2025

Certain pairs of languages suffer from lack of a parallel corpus which is large in size and diverse in domain. One of the ways this is overcome is via use of a pivot language. In this paper we use Hindi as a pivot language to translate Nepali into English. We describe what makes Hindi a good candidate for the pivot. We discuss ways in which a pivot language can be used, and use two such approaches - the Transfer Method (fully supervised) and Backtransla-tion (semi-supervised) - to translate Nepali into English. Using the former, we are able to achieve a devtest Set SacreBLEU score of 14.2, which improves the baseline fully supervised score reported by (Guzm an et al., 2019) by 6.6 points. While we are slightly below the semi-supervised baseline score of 15.1, we discuss what may have caused this under-performance, and suggest scope for future work.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.14553

Country:

Asia (0.46)
Europe (0.28)

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

Borisov, Maksim, Kozhirbayev, Zhanibek, Malykh, Valentin

arXiv.org Artificial IntelligenceMar-25-2025

Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.20007

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
(16 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reviews: Code Generation as a Dual Task of Code Summarization

Neural Information Processing SystemsJan-27-2025, 18:21:56 GMT

This paper presents an interesting approach of using the duality relationship between Code Summarization (CS) and Code Generation (CG) to improve the performance of a neural model on both tasks simultaneously. The main idea is to exploit the fact that the conditional probability of a comment given some source code, and the conditional probability of source code given a comment, are both related by their common joint probability. Moreover, since both the tasks of CS and CG use an attention-based seq2seq architecture, this paper also proposes to add an additional constraint that the two attention vectors have similar distributions, i.e. the attention weight of comment word i to source token j for the CS task is similar to the attention weights of the same pair for the CG task. The method is evaluated on two datasets of Java and Python programs/comment pairs and the dual training outperforms several baseline methods including the same architecture trained without dual constraints (basic model). Overall, I liked the idea of exploiting the dual relationship between the code summarization and code generation tasks. The proposed dual regularization terms relating to the factorization of conditional probability distributions and similarity of attention matrices are quite elegant.

code generation, code summarization, constraint, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.83)

Add feedback

ERUPD -- English to Roman Urdu Parallel Dataset

Furqan, Mohammed, Khaja, Raahid Bin, Habeeb, Rayyan

arXiv.org Artificial IntelligenceDec-23-2024

Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt engineering with real-world conversational data from personal messaging groups. We further refined the dataset through a human evaluation phase, addressing linguistic inconsistencies and ensuring accuracy in code-switching, phonetic representations, and synonym variability. The resulting dataset captures Roman Urdu's diverse linguistic features and serves as a critical resource for machine translation, sentiment analysis, and multilingual education.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2412.17562

Country:

Asia > India > Telangana > Hyderabad (0.05)
Europe > Czechia > Prague (0.04)
Asia > Middle East > Oman (0.04)
(4 more...)

Genre:

Research Report (0.65)
Overview (0.46)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback