AITopics | sinhala

Collaborating Authors

sinhala

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Translation Model

Neural Information Processing SystemsDec-25-2025, 15:17:55 GMT

Numerous recent work on unsupervised machine translation (UMT) implies that competent unsupervised translations of low-resource and unrelated languages, such as Nepali or Sinhala, are only possible if the model is trained in a massive multilingual environment, where these low-resource languages are mixed with high-resource counterparts. Nonetheless, while the high-resource languages greatly help kick-start the target low-resource translation tasks, the language discrepancy between them may hinder their further improvement. In this work, we propose a simple refinement procedure to separate languages from a pre-trained multilingual UMT model for it to focus on only the target low-resource task. Our method achieves the state of the art in the fully unsupervised translation tasks of English to Nepali, Sinhala, Gujarati, Latvian, Estonian and Kazakh, with BLEU score gains of 3.5, 3.5, 3.3, 4.1, 4.2, and 3.3, respectively. Our codebase is available at https://github.com/nxphi47/refine

language disentanglement, multilingual translation model, refining low-resource unsupervised translation, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

SinLlama -- A Large Language Model for Sinhala

Aravinda, H. W. K., Sirajudeen, Rashad, Karunathilake, Samith, de Silva, Nisansa, Ranathunga, Surangika, Kaur, Rishemjit

arXiv.org Artificial IntelligenceNov-11-2025

Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/MERCon67903.2025.11217094

2508.09115

Country: Asia > India (0.15)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Low-Resource Speech-Driven NLP Pipeline for Sinhala Dyslexia Assistance

Perera, Peshala, Sumanathilaka, Deshan

arXiv.org Artificial IntelligenceOct-7-2025

Dyslexia in adults remains an under-researched and under-served area, particularly in non-English-speaking contexts, despite its significant impact on personal and professional lives. This work addresses that gap by focusing on Sinhala, a low-resource language with limited tools for linguistic accessibility. We present an assistive system explicitly designed for Sinhala-speaking adults with dyslexia. The system integrates Whisper for speech-to-text conversion, SinBERT, an open-sourced fine-tuned BERT model trained for Sinhala to identify common dyslexic errors, and a combined mT5 and Mistral-based model to generate corrected text. Finally, the output is converted back to speech using gTTS, creating a complete multimodal feedback loop. Despite the challenges posed by limited Sinhala-language datasets, the system achieves 0.66 transcription accuracy and 0.7 correction accuracy with 0.65 overall system accuracy. These results demonstrate both the feasibility and effectiveness of the approach. Ultimately, this work highlights the importance of inclusive Natural Language Processing (NLP) technologies in underrepresented languages and showcases a practical

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.0475

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Shafique, Bhuiyan Sanjid, Vayani, Ashmal, Maaz, Muhammad, Rasheed, Hanoona Abdul, Dissanayake, Dinura, Kurpath, Mohammed Irfan, Hmaiti, Yahya, Inoue, Go, Lahoud, Jean, Rashid, Md. Safirur, Quasem, Shadid Intisar, Fatima, Maheen, Vidal, Franco, Maslych, Mykola, More, Ketan Pravin, Baliah, Sanoojan, Watawana, Hasindri, Li, Yuhao, Farestam, Fabian, Schaller, Leon, Tymtsiv, Roman, Weber, Simon, Cholakkal, Hisham, Laptev, Ivan, Satoh, Shin'ichi, Felsberg, Michael, Shah, Mubarak, Khan, Salman, Khan, Fahad Shahbaz

arXiv.org Artificial IntelligenceOct-1-2025

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.07032

Country:

Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (0.68)

Industry:

Media (0.93)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

Pramodya, Ashmari, Nelki, Nirasha, Shalinda, Heshan, Liyanage, Chamila, Sakai, Yusuke, Pushpananda, Randil, Weerasinghe, Ruvan, Kamigaito, Hidetaka, Watanabe, Taro

arXiv.org Artificial IntelligenceSep-4-2025

Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2509.03162

Country:

Europe (1.00)
Asia (1.00)
North America > United States > New Mexico (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Jayatilleke, Nevidu, de Silva, Nisansa

arXiv.org Artificial IntelligenceAug-26-2025

Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.18264

Country:

North America > United States (1.00)
Asia (1.00)

Genre: Research Report (0.64)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Sumanathilaka, Deshan, Perera, Sameera, Dharmasiri, Sachithya, Athukorala, Maneesha, Herath, Anuja Dilrukshi, Dias, Rukshan, Gamage, Pasindu, Weerasinghe, Ruvan, Priyadarshana, Y. H. P. P.

arXiv.org Artificial IntelligenceJul-15-2025

The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.

machine learning, natural language, transliteration, (19 more...)

arXiv.org Artificial Intelligence

2507.09245

Country:

Asia (0.69)
Europe > United Kingdom (0.28)

Genre:

Research Report (0.64)
Overview (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.53)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Enhancing Multilingual Sentiment Analysis with Explainability for Sinhala, English, and Code-Mixed Content

Rizvi, Azmarah, Thamindu, Navojith, Adhikari, A. M. N. H., Senevirathna, W. P. U., Kasthurirathna, Dharshana, Abeywardhana, Lakmini

arXiv.org Artificial IntelligenceApr-21-2025

Sentiment analysis is crucial for brand reputation management in the banking sector, where customer feedback spans English, Sinhala, Singlish, and code-mixed text. Existing models struggle with low-resource languages like Sinhala and lack interpretability for practical use. This research develops a hybrid aspect-based sentiment analysis framework that enhances multilingual capabilities with explainable outputs. Using cleaned banking customer reviews, we fine-tune XLM-RoBERTa for Sinhala and code-mixed text, integrate domain-specific lexicon correction, and employ BERT-base-uncased for English. The system classifies sentiment (positive, neutral, negative) with confidence scores, while SHAP and LIME improve interpretability by providing real-time sentiment explanations. Experimental results show that our approaches outperform traditional transformer-based classifiers, achieving 92.3 percent accuracy and an F1-score of 0.89 in English and 88.4 percent in Sinhala and code-mixed content. An explainability analysis reveals key sentiment drivers, improving trust and transparency. A user-friendly interface delivers aspect-wise sentiment insights, ensuring accessibility for businesses. This research contributes to robust, transparent sentiment analysis for financial applications by bridging gaps in multilingual, low-resource NLP and explainability.

machine learning, natural language, sinhala, (15 more...)

arXiv.org Artificial Intelligence

2504.13545

Country: Asia > Sri Lanka (0.18)

Genre: Research Report > New Finding (0.66)

Industry:

Banking & Finance (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Keyword Extraction, and Aspect Classification in Sinhala, English, and Code-Mixed Content

Rizvi, F. A., Navojith, T., Adhikari, A. M. N. H., Senevirathna, W. P. U., Kasthurirathna, Dharshana, Abeywardhana, Lakmini

arXiv.org Artificial IntelligenceApr-16-2025

Brand reputation in the banking sector is maintained through insightful analysis of customer opinion on code-mixed and multilingual content. Conventional NLP models misclassify or ignore code-mixed text, when mix with low resource languages such as Sinhala-English and fail to capture domain-specific knowledge. This study introduces a hybrid NLP method to improve keyword extraction, content filtering, and aspect-based classification of banking content. Keyword extraction in English is performed with a hybrid approach comprising a fine-tuned SpaCy NER model, FinBERT-based KeyBERT embeddings, YAKE, and EmbedRank, which results in a combined accuracy of 91.2%. Code-mixed and Sinhala keywords are extracted using a fine-tuned XLM-RoBERTa model integrated with a domain-specific Sinhala financial vocabulary, and it results in an accuracy of 87.4%. To ensure data quality, irrelevant comment filtering was performed using several models, with the BERT-base-uncased model achieving 85.2% for English and XLM-RoBERTa 88.1% for Sinhala, which was better than GPT-4o, SVM, and keyword-based filtering. Aspect classification followed the same pattern, with the BERT-base-uncased model achieving 87.4% for English and XLM-RoBERTa 85.9% for Sinhala, both exceeding GPT-4 and keyword-based approaches. These findings confirm that fine-tuned transformer models outperform traditional methods in multilingual financial text analysis. The present framework offers an accurate and scalable solution for brand reputation monitoring in code-mixed and low-resource banking environments.

information retrieval, large language model, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2504.10679

Country: Asia > Sri Lanka (0.17)

Genre: Research Report (0.70)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Framework to Assess Multilingual Vulnerabilities of LLMs

Tang, Likai, Bogahawatta, Niruth, Ginige, Yasod, Xu, Jiarui, Sun, Shixuan, Ranathunga, Surangika, Seneviratne, Suranga

arXiv.org Artificial IntelligenceMar-17-2025

Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework's results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model's poor performance, resulting in incoherent responses.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.13081

Country:

Oceania > Australia > New South Wales > Sydney (0.06)
North America > United States > District of Columbia > Washington (0.05)
Oceania > New Zealand (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.78)

Add feedback