Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus
Churina, Svetlana, Gupta, Akshat, Mujtahid, Insyirah, Jaidka, Kokil
–arXiv.org Artificial Intelligence
Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant - messaging exchanges, there has been a lack of publicly available corpora that are author - labeled and suitable for modeling human conversations and relationships. This study intro - duces the first labeled and general-purpose corpus for understanding code - mixing in context while maintaining rigorous privacy and ethi - cal standards. Our live project will continu - ously gather, verify, and integrate code - mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code - mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foun - dational dataset for research in computational linguistics, sociolinguistics, and NLP applica - tions. Code and dataset sample can be found here.
arXiv.org Artificial Intelligence
Jun-17-2025
- Country:
- Asia
- East Asia (0.04)
- India > Tamil Nadu
- Chennai (0.04)
- Indonesia > Sulawesi
- South Sulawesi (0.04)
- Singapore > Central Region
- Singapore (0.04)
- Southeast Asia (0.05)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States > Texas
- Travis County > Austin (0.04)
- Mexico > Mexico City
- Asia
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology > Services (0.93)
- Technology: