AITopics | malay

Collaborating Authors

malay

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Tan, Leanne, Chua, Gabriel, Ge, Ziyu, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceSep-30-2025

Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2507.15339

Country:

Asia > Singapore (1.00)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)

Add feedback

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Hu, Yujia, Hee, Ming Shan, Nakov, Preslav, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceSep-24-2025

The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.1526

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Singapore (0.63)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.68)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval

Rastogi, Pranshu

arXiv.org Artificial IntelligenceAug-6-2025

SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.03475

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

Zheng, Weihua, Lee, Roy Ka-Wei, Liu, Zhengyuan, Wu, Kui, Aw, AiTi, Zou, Bowei

arXiv.org Artificial IntelligenceJul-22-2025

Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.14239

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation

Ge, Ziyu, Chua, Gabriel, Tan, Leanne, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceJul-17-2025

As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech. Translating toxic content between low-resource language pairs poses additional challenges due to scarce parallel data and safety filters that sanitize offensive expressions. In this work, we propose a reproducible, two-stage framework for toxicity-preserving translation, demonstrated on a code-mixed Singlish safety corpus. First, we perform human-verified few-shot prompt engineering: we iteratively curate and rank annotator-selected Singlish-target examples to capture nuanced slang, tone, and toxicity. Second, we optimize model-prompt pairs by benchmarking several large language models using semantic similarity via direct and back-translation. Quantitative human evaluation confirms the effectiveness and efficiency of our pipeline. Beyond improving translation quality, our framework contributes to the safety of multicultural LLMs by supporting culturally sensitive moderation and benchmarking in low-resource contexts. By positioning Singlish as a testbed for inclusive NLP, we underscore the importance of preserving sociolinguistic nuance in real-world applications such as content moderation and regional platform governance.

large language model, machine learning, translation, (20 more...)

arXiv.org Artificial Intelligence

2507.11966

Country: Asia (0.15)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Chua, Gabriel, Tan, Leanne, Ge, Ziyu, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceJul-9-2025

Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

large language model, machine learning, translation, (20 more...)

arXiv.org Artificial Intelligence

2507.0598

Country:

North America (1.00)
Asia > Singapore (0.49)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.83)

Industry:

Law (1.00)
Information Technology (0.93)
Health & Medicine (0.70)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

Nguyen, Tuan, Tran, Huy-Dat

arXiv.org Artificial IntelligenceJun-18-2025

--Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants. Code-switching--switching between languages within the same conversation--is a common and natural way of speaking in many multilingual communities.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.1419

Country:

Asia > Singapore (0.06)
Europe > Slovenia (0.04)
Asia > Southeast Asia (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Splintering Nonconcatenative Languages for Better Tokenization

Gazit, Bar, Shmidman, Shaltiel, Shmidman, Avi, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-18-2025

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER's merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.14433

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

Huang, Xin, Vangani, Tarun Kumar, Pham, Minh Duc, Zou, Xunlong, Wang, Bin, Liu, Zhengyuan, Aw, Ai Ti

arXiv.org Artificial IntelligenceJan-16-2025

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

language model, meralion-textllm, singlish, (1 more...)

arXiv.org Artificial Intelligence

2501.08335

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara

Nazri, Azree, Agbolade, Olalekan, Aziz, Faisal

arXiv.org Artificial IntelligenceOct-9-2024

In contexts with limited computational and data resources, high-resource language models often prove inadequate, particularly when addressing the specific needs of Malay languages. This paper introduces a Personal Intelligence System designed to efficiently integrate both on-device and server-based models. The system incorporates SLiM-34M for on-device processing, optimized for low memory and power usage, and MANYAK-1.3B for server-based tasks, allowing for scalable, high-performance language processing. The models achieve significant results across various tasks, such as machine translation, question-answering, and translate IndoMMLU. Particularly noteworthy is SLiM-34M's ability to achieve a high improvement in accuracy compared to other LLMs while using 2 times fewer pre-training tokens. This work challenges the prevailing assumption that large-scale computational resources are necessary to build effective language models, contributing to the development of resource-efficient models for the Malay language with the unique orchestration between SLiM-34M and MANYAK-1.3B.

arxiv preprint arxiv, language model, malay, (15 more...)

arXiv.org Artificial Intelligence

2410.06973

Country:

Asia > Indonesia > Borneo > Kalimantan > East Kalimantan > Nusantara (0.43)
Asia > Brunei (0.06)
Asia > Singapore (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (0.93)

Industry:

Education (0.68)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback