code-mixed sentence
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Yan, Brian, Hamed, Injy, Shimizu, Shuichiro, Lodagala, Vasista, Chen, William, Iakovenko, Olga, Talafha, Bashar, Hussein, Amir, Polok, Alexander, Chang, Kalvin, Klement, Dominik, Althubaiti, Sara, Peng, Puyuan, Wiesner, Matthew, Solorio, Thamar, Ali, Ahmed, Khudanpur, Sanjeev, Watanabe, Shinji, Chen, Chih-Chen, Wu, Zhen, Benharrak, Karim, Diwan, Anuj, Cornell, Samuele, Yeo, Eunjung, Choi, Kwanghee, Carvalho, Carlos, Rosero, Karen
CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.
What talking you?: Translating Code-Mixed Messaging Texts to English
Ng, Lynnette Hui Xian, Chan, Luo Qi
Translation of code-mixed texts to formal English allow a wider audience to understand these code-mixed languages, and facilitate downstream analysis applications such as sentiment analysis. In this work, we look at translating Singlish, which is colloquial Singaporean English, to formal standard English. Singlish is formed through the code-mixing of multiple Asian languages and dialects. We analysed the presence of other Asian languages and variants which can facilitate translation. Our dataset is short message texts, written as informal communication between Singlish speakers. We use a multi-step prompting scheme on five Large Language Models (LLMs) for language detection and translation. Our analysis show that LLMs do not perform well in this task, and we describe the challenges involved in translation of code-mixed languages. We also release our dataset in this link https://github.com/luoqichan/singlish.
Code-Mixer Ya Nahi: Novel Approaches to Measuring Multilingual LLMs' Code-Mixing Capabilities
Gupta, Ayushman, Bhogal, Akhil, Ghosh, Kripabandhu
Multilingual Large Language Models (LLMs) have demonstrated exceptional performance in Machine Translation (MT) tasks. However, their MT abilities in the context of code-switching (the practice of mixing two or more languages in an utterance) remain under-explored. In this paper, we introduce Rule-Based Prompting, a novel prompting technique to generate code-mixed sentences. We measure and compare the code-mixed MT abilities of 3 popular multilingual LLMs: GPT-3.5-turbo, GPT-4, and Gemini Pro across five language pairs: English-{Hindi, Bengali, Gujarati, French, Spanish} using $k$-shot prompting ($k\in\{0, 1, 10, 20\}$) and Rule-Based Prompting. Our findings suggest that though $k$-shot prompting often leads to the best results, Rule-Based prompting shows promise in generating unique code-mixed sentences that vary in their style of code-mixing. We also use $k$-shot prompting to gauge the code-mixed to English translation abilities of multilingual LLMs. For this purpose, we create a gold-standard code-mixed dataset spanning five language pairs: English-{Hindi, Bengali, Gujarati, French, Spanish}. As a real-world application of our work, we create a code-mixed chatbot.
Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences
Gupta, Ayushman, Bhogal, Akhil, Ghosh, Kripabandhu
Code-mixing, the practice of alternating between two or more languages in an utterance, is a common phenomenon in multilingual communities. Due to the colloquial nature of code-mixing, there is no singular correct way to translate an English sentence into a code-mixed sentence. For this reason, standard n-gram-based MT evaluation metrics such as the BLEU score are not appropriate for code-mixed evaluation. To demonstrate this, we propose a novel method for code-mixed text generation: Controlled Generation, which parameterizes the code-mixing degree (CMD) and enables the generation of multiple semantically equivalent code-mixed sentences from a given English sentence. We introduce a robust new evaluation metric: GAME: A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences. GAME is both language-agnostic and gold-standard-agnostic, i.e. unlike other metrics, GAME does not require gold-standard code-mixed sentences for evaluation, thus eliminating the need for human annotators in the code-mixed evaluation process. When used to evaluate semantically equivalent code-mixed sentences, we find that GAME scores have a lower standard deviation than BLEU scores. Further, we create and release a dataset containing gold-standard code-mixed sentences across 4 language pairs: English-{Hindi, Bengali, French, Spanish} to encourage more computational research on code-mixing.
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Kodali, Prashant, Goel, Anmol, Asapu, Likhith, Bonagiri, Vamshi Krishna, Govil, Anirudh, Choudhury, Monojit, Shrivastava, Manish, Kumaraguru, Ponnurangam
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.
IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages
Adilazuarda, Muhammad Farid, Cahyawijaya, Samuel, Winata, Genta Indra, Fung, Pascale, Purwarianti, Ayu
In addition, we explore Processing (NLP) have introduced an immense methods to improve the robustness of LMs to improvement in many aspects, including code-mixed text. Using our IndoRobusta-Shot, standardized benchmarks (Wilie et al., 2020; we perform adversarial training to improve the Cahyawijaya et al., 2021; Koto et al., 2020; Winata code-mixed robustness of LMs. We explore three et al., 2022), large pre-trained language model kinds of tuning strategies: 1) code-mix only, 2) (LM) (Wilie et al., 2020; Cahyawijaya et al., 2021; two-steps, and 3) joint training, and empirically Koto et al., 2020), and resource expansion covering search for the best strategy to improve the model local Indonesian languages (Tri Apriani, 2016; robustness on code-mixed data.
Marathi-English Code-mixed Text Generation
Amin, Dhiraj, Govilkar, Sharvari, Kulkarni, Sagar, Lalit, Yash Shashikant, Khwaja, Arshi Ajaz, Xavier, Daries, Gupta, Sahil Girijashankar
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings, yielding hybrid languages like Hinglish and Minglish. Marathi, India's third most spoken language, often integrates English for precision and formality. Developing code-mixed language systems, like Marathi-English (Minglish), faces resource constraints. This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics. Across 2987 code-mixed questions, it achieved an average CMI of 0.2 and an average DCM of 7.4, indicating effective and comprehensible code-mixed sentences. These results offer potential for enhanced NLP tools, bridging linguistic gaps in multilingual societies.
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Yong, Zheng-Xin, Zhang, Ruochen, Forde, Jessica Zosa, Wang, Skyler, Subramonian, Arjun, Lovenia, Holy, Cahyawijaya, Samuel, Winata, Genta Indra, Sutawika, Lintang, Cruz, Jan Christian Blaise, Tan, Yin Lin, Phan, Long, Garcia, Rowena, Solorio, Thamar, Aji, Alham Fikri
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.