Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus

Churina, Svetlana, Gupta, Akshat, Mujtahid, Insyirah, Jaidka, Kokil

Jun-17-2025–arXiv.org Artificial Intelligence

Code-mixing involves the seamless integration of linguistic elements from multiple languages within a single discourse, reflecting natural multilingual communication patterns. Despite its prominence in informal interactions such as social media, chat messages and instant - messaging exchanges, there has been a lack of publicly available corpora that are author - labeled and suitable for modeling human conversations and relationships. This study intro - duces the first labeled and general-purpose corpus for understanding code - mixing in context while maintaining rigorous privacy and ethi - cal standards. Our live project will continu - ously gather, verify, and integrate code - mixed messages into a structured dataset released in JSON format, accompanied by detailed metadata and linguistic statistics. To date, it includes over 355,641 messages spanning various code - mixing patterns, with a primary focus on English, Mandarin, and other languages. We expect the Codemix Corpus to serve as a foun - dational dataset for research in computational linguistics, sociolinguistics, and NLP applica - tions. Code and dataset sample can be found here.

artificial intelligence, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-17-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)
- North America > Mexico (0.28)
- Asia > Indonesia (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology > Services (0.93)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence > Natural Language
    - Large Language Model (0.68)
    - Text Processing (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found