AITopics | singlish

Collaborating Authors

singlish

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Hu, Yujia, Hee, Ming Shan, Nakov, Preslav, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceSep-24-2025

The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.1526

Country:

Asia > Singapore (0.63)
Europe > Austria > Vienna (0.14)
North America > Mexico > Mexico City > Mexico City (0.04)
(13 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.68)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation

Ge, Ziyu, Chua, Gabriel, Tan, Leanne, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceJul-17-2025

As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech. Translating toxic content between low-resource language pairs poses additional challenges due to scarce parallel data and safety filters that sanitize offensive expressions. In this work, we propose a reproducible, two-stage framework for toxicity-preserving translation, demonstrated on a code-mixed Singlish safety corpus. First, we perform human-verified few-shot prompt engineering: we iteratively curate and rank annotator-selected Singlish-target examples to capture nuanced slang, tone, and toxicity. Second, we optimize model-prompt pairs by benchmarking several large language models using semantic similarity via direct and back-translation. Quantitative human evaluation confirms the effectiveness and efficiency of our pipeline. Beyond improving translation quality, our framework contributes to the safety of multicultural LLMs by supporting culturally sensitive moderation and benchmarking in low-resource contexts. By positioning Singlish as a testbed for inclusive NLP, we underscore the importance of preserving sociolinguistic nuance in real-world applications such as content moderation and regional platform governance.

large language model, machine learning, translation, (20 more...)

arXiv.org Artificial Intelligence

2507.11966

Country:

Asia > Singapore (0.05)
North America > Canada (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Sumanathilaka, Deshan, Perera, Sameera, Dharmasiri, Sachithya, Athukorala, Maneesha, Herath, Anuja Dilrukshi, Dias, Rukshan, Gamage, Pasindu, Weerasinghe, Ruvan, Priyadarshana, Y. H. P. P.

arXiv.org Artificial IntelligenceJul-15-2025

The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.

machine learning, natural language, transliteration, (20 more...)

arXiv.org Artificial Intelligence

2507.09245

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > United Kingdom > Wales (0.04)
Asia > Sri Lanka > Western Province > Colombo > Colombo (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:

Research Report (0.64)
Overview (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.53)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Chua, Gabriel, Tan, Leanne, Ge, Ziyu, Lee, Roy Ka-Wei

arXiv.org Artificial IntelligenceJul-9-2025

Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

large language model, machine learning, translation, (20 more...)

arXiv.org Artificial Intelligence

2507.0598

Country:

Asia > Singapore (0.49)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > Thailand > Bangkok > Bangkok (0.04)
(9 more...)

Genre: Research Report (0.83)

Industry:

Law (1.00)
Information Technology (0.93)
Health & Medicine (0.70)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

Nguyen, Tuan, Tran, Huy-Dat

arXiv.org Artificial IntelligenceJun-18-2025

--Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants. Code-switching--switching between languages within the same conversation--is a common and natural way of speaking in many multilingual communities.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.1419

Country:

Asia > Singapore (0.06)
Europe > Slovenia (0.04)
Asia > Southeast Asia (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Enhancing Multilingual Sentiment Analysis with Explainability for Sinhala, English, and Code-Mixed Content

Rizvi, Azmarah, Thamindu, Navojith, Adhikari, A. M. N. H., Senevirathna, W. P. U., Kasthurirathna, Dharshana, Abeywardhana, Lakmini

arXiv.org Artificial IntelligenceApr-21-2025

Sentiment analysis is crucial for brand reputation management in the banking sector, where customer feedback spans English, Sinhala, Singlish, and code-mixed text. Existing models struggle with low-resource languages like Sinhala and lack interpretability for practical use. This research develops a hybrid aspect-based sentiment analysis framework that enhances multilingual capabilities with explainable outputs. Using cleaned banking customer reviews, we fine-tune XLM-RoBERTa for Sinhala and code-mixed text, integrate domain-specific lexicon correction, and employ BERT-base-uncased for English. The system classifies sentiment (positive, neutral, negative) with confidence scores, while SHAP and LIME improve interpretability by providing real-time sentiment explanations. Experimental results show that our approaches outperform traditional transformer-based classifiers, achieving 92.3 percent accuracy and an F1-score of 0.89 in English and 88.4 percent in Sinhala and code-mixed content. An explainability analysis reveals key sentiment drivers, improving trust and transparency. A user-friendly interface delivers aspect-wise sentiment insights, ensuring accessibility for businesses. This research contributes to robust, transparent sentiment analysis for financial applications by bridging gaps in multilingual, low-resource NLP and explainability.

machine learning, natural language, sinhala, (15 more...)

arXiv.org Artificial Intelligence

2504.13545

Country:

Asia > Sri Lanka > Western Province > Colombo > Colombo (0.06)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Banking & Finance (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study

Lim, Isaac, Khoo, Shaun, Chua, Watson, Jiayi, Goh, Foo, Jessica

arXiv.org Artificial IntelligenceFeb-17-2025

To ensure safe usage, Large Language Models (LLMs) typically undergo alignment with human-defined values. However, this alignment often relies on primarily English data and is biased towards Western-centric values, limiting its effectiveness in low-resource language settings. In this paper, we describe our approach for aligning SEA-Lion-v2.1-Instruct (a Llama3-8B variant) to minimize toxicity in Singlish, an English creole specific to Singapore. We find that supervised fine-tuning and Kahneman-Tversky Optimization (KTO) on paired and unpaired preferences is more sample efficient and yields significantly better results than Direct Preference Optimization (DPO). Our analysis reveals that DPO implicitly enforces a weaker safety objective than KTO, and that SFT complements KTO by improving training stability. Finally, we introduce a simple but novel modification to KTO, KTO-S, which improves training stability through better gradient exploitation. Overall, we present a general approach for safety alignment conducive to low-resource English languages, successfully reducing toxicity by 99\% on our Singlish benchmark, with gains generalizing to the broader TOXIGEN dataset while maintaining strong performance across standard LLM benchmarks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.12485

Country:

Asia > Singapore (0.25)
Asia > Middle East > Jordan (0.04)
Europe > Monaco (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Law Enforcement & Public Safety (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

Huang, Xin, Vangani, Tarun Kumar, Pham, Minh Duc, Zou, Xunlong, Wang, Bin, Liu, Zhengyuan, Aw, Ai Ti

arXiv.org Artificial IntelligenceJan-16-2025

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

language model, meralion-textllm, singlish, (1 more...)

arXiv.org Artificial Intelligence

2501.08335

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Wang, Bin, Zou, Xunlong, Sun, Shuo, Zhang, Wenyu, He, Yingxu, Liu, Zhuohan, Wei, Chengwei, Chen, Nancy F., Aw, AiTi

arXiv.org Artificial IntelligenceJan-10-2025

Existing Singlish spoken corpora have primarily focused on linguistic analysis and speech recognition Speech technologies have evolved over decades, tasks (Deterding and Low, 2001; Chen et al., progressing from modularized solutions for speech 2010; Lyu et al., 2010; Tan, 2019). Given the relatively recognition (Povey et al., 2011; Radford et al., small population of Singlish speakers, estimated 2023), speaker identification (Togneri and Pullella, at just a few million, resources for Singlish 2011), and gender recognition (Hechmi et al., speech corpora are significantly more limited compared 2021) with modularized toolkits like Kaldi (Povey to major languages like English, Chinese, et al., 2011) and ESPnet (Watanabe et al., 2018) French, and Spanish. Singapore's government to advanced solutions integrating large language agency, IMDA, has open-sourced the largest available models for multimodal understanding in an allencompassing, Singlish corpus, known as the National Speech omni-style approach (Team et al., Corpus (Koh et al., 2019).

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.01034

Country:

Asia > Singapore (0.58)
North America > Canada > Ontario > Toronto (0.04)
Asia > Nepal (0.04)
Asia > East Asia (0.04)

Genre: Research Report (1.00)

Industry: Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Limpeh ga li gong: Challenges in Singlish Annotations

Chan, Luo Qi, Ng, Lynnette Hui Xian

arXiv.org Artificial IntelligenceNov-7-2024

Singlish, or Colloquial Singapore English, is a language formed from oral and social communication within multicultural Singapore. In this work, we work on a fundamental Natural Language Processing (NLP) task: Parts-Of-Speech (POS) tagging of Singlish sentences. For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers. Our experiments show that automatic transition- and transformer- based taggers perform with only $\sim 80\%$ accuracy when evaluated against human-annotated POS labels, suggesting that there is indeed room for improvement on computation analysis of the language. We provide an exposition of challenges in Singlish annotation: its inconsistencies in form and semantics, the highly context-dependent particles of the language, its structural unique expressions, and the variation of the language on different mediums. Our task definition, resultant labels and results reflects the challenges in analysing colloquial languages formulated from a variety of dialects, and paves the way for future studies beyond POS tagging.

machine learning, natural language, singlish, (19 more...)

arXiv.org Artificial Intelligence

2410.16156

Country:

Asia > Singapore (0.48)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback