AITopics | tamil

Collaborating Authors

tamil

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

IndicGEC: Powerful Models, or a Measurement Mirage?

Vajjala, Sowmya

arXiv.org Artificial IntelligenceNov-20-2025

In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.

indian language, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.1526

Country:

Europe > Austria > Vienna (0.15)
North America > United States > Maryland (0.14)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

Automated evaluation of children's speech fluency for low-resource languages

Zhang, Bowen, Latiff, Nur Afiqah Abdul, Kan, Justin, Tong, Rong, Soh, Donny, Miao, Xiaoxiao, McLoughlin, Ian

arXiv.org Artificial IntelligenceOct-24-2025

Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper propose s a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extract ion stage, and a generative pre-trained transformer (GPT) netw ork. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT -based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's spee ch in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XG - Boost, as well as using ChatGPT -4o to predict fluency directl y from speech input. Results demonstrate that the proposed ap - proach achieves significantly higher accuracy than multimo dal GPT or other methods.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1358

2505.19671

Country: Asia (0.30)

Genre: Research Report > New Finding (0.48)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Ramamoorthy, Sathyanarayanan, Shah, Vishwa, Khanuja, Simran, Sheikh, Zaid, Jie, Shan, Chia, Ann, Chua, Shearman, Neubig, Graham

arXiv.org Artificial IntelligenceOct-17-2025

This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

computational linguistic, information retrieval, large language model, (21 more...)

arXiv.org Artificial Intelligence

2510.14307

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology (0.68)
Government (0.68)
Leisure & Entertainment > Sports (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.50)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Tan, Leanne, Chua, Gabriel, Ge, Ziyu, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceSep-30-2025

Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2507.15339

Country:

Asia > Singapore (1.00)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)

Add feedback

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Hu, Yujia, Hee, Ming Shan, Nakov, Preslav, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceSep-24-2025

The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.1526

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Singapore (0.63)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.68)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text

Velasco, Dan John, Roque, Matthew Theodore

arXiv.org Artificial IntelligenceSep-23-2025

Most languages lack sufficient data for large-scale monolingual pretraining, creating a "data wall." Multilingual pretraining helps but is limited by language imbalance and the "curse of multilinguality." An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil--two typologically distant, lower-resource languages--and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.17317

Country:

North America > United States (1.00)
Asia (1.00)
Europe (0.93)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (0.93)
Transportation (0.68)
Law (0.68)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Jayatilleke, Nevidu, de Silva, Nisansa

arXiv.org Artificial IntelligenceAug-26-2025

Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.18264

Country:

North America > United States (1.00)
Asia (1.00)

Genre: Research Report (0.64)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

Zheng, Weihua, Lee, Roy Ka-Wei, Liu, Zhengyuan, Wu, Kui, Aw, AiTi, Zou, Bowei

arXiv.org Artificial IntelligenceJul-22-2025

Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.14239

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning

Liu, Zhengyuan, Lin, Geyu, Tan, Hui Li, Zhang, Huayun, Lu, Yanfeng, Gao, Xiaoxue, Yin, Stella Xin, Sun, He, Goh, Hock Huan, Wong, Lung Hsiang, Chen, Nancy F.

arXiv.org Artificial IntelligenceJun-4-2025

The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes. In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.02412

Country:

Asia (0.47)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.73)
Education > Educational Setting > K-12 Education > Primary School (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions

Krasitskii, Mikhail, Kolesnikova, Olga, Hernandez, Liliana Chanona, Sidorov, Grigori, Gelbukh, Alexander

arXiv.org Artificial IntelligenceMar-29-2025

The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.

large language model, machine learning, mixed text, (17 more...)

arXiv.org Artificial Intelligence

2503.23295

Country:

Asia > Indonesia > Bali (0.06)
South America (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
North America > Central America (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback