AITopics | code-mixed data

Collaborating Authors

code-mixed data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition

Shirke, Mayur, Shembade, Amey, Thorat, Pavan, Wagh, Madhushri, Joshi, Raviraj

arXiv.org Artificial IntelligenceSep-3-2025

Named Entity Recognition (NER) in code-mixed text, particularly Hindi-English (Hinglish), presents unique challenges due to informal structure, transliteration, and frequent language switching. This study conducts a comparative evaluation of code-mixed fine-tuned models and non-code-mixed multilingual models, along with zero-shot generative large language models (LLMs). Specifically, we evaluate HingBERT, HingM-BERT, and HingRoBERTa (trained on code-mixed data), and BERT Base Cased, IndicBERT, RoBERTa and MuRIL (trained on non-code-mixed multilingual data). We also assess the performance of Google Gemini in a zero-shot setting using a modified version of the dataset with NER tags removed. All models are tested on a benchmark Hinglish NER dataset using Precision, Recall, and F1-score. Results show that code-mixed models, particularly HingRoBERTa and HingBERT -based fine-tuned models, outperform others -- including closed-source LLMs like Google Gemini -- due to domain-specific pretraining. Non-code-mixed models perform reasonably but show limited adaptability. Notably, Google Gemini exhibits competitive zero-shot performance, underlining the generalization strength of modern LLMs. This study provides key insights into the effectiveness of specialized versus generalized models for code-mixed NER tasks.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.02514

Country:

Asia > India > Maharashtra > Pune (0.14)
Asia > Indonesia > Bali (0.05)
Asia > India > Tamil Nadu > Chennai (0.04)

Genre: Research Report > New Finding (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On Importance of Code-Mixed Embeddings for Hate Speech Identification

Jagdale, Shruti, Khade, Omkar, Takalikar, Gauri, Inamdar, Mihir, Joshi, Raviraj

arXiv.org Artificial IntelligenceNov-27-2024

Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, face challenges when dealing with code-mixed data. Extracting meaningful information from sentences containing multiple languages becomes difficult, particularly in tasks like hate speech detection, due to linguistic variation, cultural nuances, and data sparsity. To address this, we aim to analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech detection. Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform BERT models when tested on hate speech text datasets. We also found that code-mixed Hing-FastText performs better than standard English FastText and vanilla BERT models.

dataset, embedding, hingbert model, (13 more...)

arXiv.org Artificial Intelligence

2411.18577

Country:

North America > United States > California > Orange County > Newport Beach (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Asia > India > Maharashtra > Pune (0.05)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection

Raihan, Nishat, Goswami, Dhiman, Mahmud, Antara, Anastasopoulos, Antonios, Zampieri, Marcos

arXiv.org Artificial IntelligenceMay-11-2024

Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

dataset, emomix-3l, proceedings, (11 more...)

arXiv.org Artificial Intelligence

2405.06922

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore (0.04)
Asia > India > West Bengal (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Marathi-English Code-mixed Text Generation

Amin, Dhiraj, Govilkar, Sharvari, Kulkarni, Sagar, Lalit, Yash Shashikant, Khwaja, Arshi Ajaz, Xavier, Daries, Gupta, Sahil Girijashankar

arXiv.org Artificial IntelligenceSep-28-2023

Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings, yielding hybrid languages like Hinglish and Minglish. Marathi, India's third most spoken language, often integrates English for precision and formality. Developing code-mixed language systems, like Marathi-English (Minglish), faces resource constraints. This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics. Across 2987 code-mixed questions, it achieved an average CMI of 0.2 and an average DCM of 7.4, indicating effective and comprehensible code-mixed sentences. These results offer potential for enhanced NLP tools, bridging linguistic gaps in multilingual societies.

code-mixed sentence, code-mixed text, dataset, (16 more...)

arXiv.org Artificial Intelligence

2309.16202

Country:

Africa > Middle East > Egypt > Giza Governorate > Giza (0.04)
Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report (0.40)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Yong, Zheng-Xin, Zhang, Ruochen, Forde, Jessica Zosa, Wang, Skyler, Subramonian, Arjun, Lovenia, Holy, Cahyawijaya, Samuel, Winata, Genta Indra, Sutawika, Lintang, Cruz, Jan Christian Blaise, Tan, Yin Lin, Phan, Long, Garcia, Rowena, Solorio, Thamar, Aji, Alham Fikri

arXiv.org Artificial IntelligenceSep-12-2023

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

chatgpt, code-mixed data, computational linguistic, (16 more...)

arXiv.org Artificial Intelligence

2303.13592

Country:

Asia > East Asia (0.24)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > Philippines (0.04)
(23 more...)

Genre:

Research Report (1.00)
Personal > Interview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts

Tonja, Atnafu Lambebo, Yigezu, Mesay Gemeda, Kolesnikova, Olga, Tash, Moein Shahiki, Sidorov, Grigori, Gelbuk, Alexander

arXiv.org Artificial IntelligenceNov-25-2022

Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Polit\'ecnico Nacional, Centro de Investigaci\'on en Computaci\'on (CIC) team's system description paper for the CoLI-Kanglish shared task at ICON2022. In this paper, we propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts. The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2211.14459

Country:

Africa (0.05)
South America (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.79)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.56)

Add feedback

Domain Curricula for Code-Switched MT at MixMT 2022

Raheem, Lekan, Elrashid, Maab

arXiv.org Artificial IntelligenceOct-31-2022

In multilingual colloquial settings, it is a habitual occurrence to compose expressions of text or speech containing tokens or phrases of different languages, a phenomenon popularly known as code-switching or code-mixing (CMX). We present our approach and results for the Code-mixed Machine Translation (MixMT) shared task at WMT 2022: the task consists of two subtasks, monolingual to code-mixed machine translation (Subtask-1) and code-mixed to monolingual machine translation (Subtask-2). Most non-synthetic code-mixed data are from social media but gathering a significant amount of this kind of data would be laborious and this form of data has more writing variation than other domains, so for both subtasks, we experimented with data schedules for out-of-domain data. We jointly learn multiple domains of text by pretraining and fine-tuning, combined with a sentence alignment objective. We found that switching between domains caused improved performance in the domains seen earliest during training, but depleted the performance on the remaining domains. A continuous training run with strategically dispensed data of different domains showed a significantly improved performance over fine-tuning.

machine learning, natural language, translation, (19 more...)

arXiv.org Artificial Intelligence

2210.17463

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > Canada > Ontario (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
(4 more...)

Genre:

Instructional Material > Course Syllabus & Notes (0.64)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

ExCode-Mixed: Explainable Approaches towards Sentiment Analysis on Code-Mixed Data using BERT models

Priyanshu, Aman, Vardhan, Aleti, Sivakumar, Sudarshan, Vijay, Supriti, Chhabra, Nipuna

arXiv.org Artificial IntelligenceSep-25-2021

The increasing use of social media sites in countries like India has given rise to large volumes of code-mixed data. Sentiment analysis of this data can provide integral insights into people's perspectives and opinions. Developing robust explainability techniques which explain why models make their predictions becomes essential. In this paper, we propose an adequate methodology to integrate explainable approaches into code-mixed sentiment analysis.

explanation, prediction, sentiment analysis, (14 more...)

arXiv.org Artificial Intelligence

2109.032

Country:

Asia > India (0.25)
Asia > Indonesia > Bali (0.05)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.05)

Genre: Research Report (0.84)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.97)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.97)

Add feedback

NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Code-Mixed Dravidian text using XLNet

Banerjee, Shubhanker, Jayapal, Arun, Thavareesan, Sajeetha

arXiv.org Artificial IntelligenceOct-15-2020

Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication. So it looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world.Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages.One such Natural Language Processing task is sentiment analysis, for this we use an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.

large language model, machine learning, sentiment analysis, (18 more...)

arXiv.org Artificial Intelligence

2010.07773

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.15)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Ireland > Connaught > County Galway > Galway (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.90)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

WESSA at SemEval-2020 Task 9: Code-Mixed Sentiment Analysis using Transformers

Sultan, Ahmed, Salim, Mahmoud, Gaber, Amina, Hosary, Islam El

arXiv.org Artificial IntelligenceSep-21-2020

In this paper, we describe our system submitted for SemEval 2020 Task 9, Sentiment Analysis for Code-Mixed Social Media Text alongside other experiments. Our best performing system is a Transfer Learning-based model that fine-tunes "XLM-RoBERTa", a transformer-based multilingual masked language model, on monolingual English and Spanish data and Spanish-English code-mixed data. Our system outperforms the official task baseline by achieving a 70.1% average F1-Score on the official leaderboard using the test set. For later submissions, our system manages to achieve a 75.9% average F1-Score on the test set using CodaLab username "ahmed0sultan".

machine learning, natural language, sentiment analysis, (14 more...)

arXiv.org Artificial Intelligence

2009.09879

Country:

Asia > Indonesia > Bali (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.91)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.77)

Add feedback