MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

Gupta, Rahul, Srivastava, Vivek, Singh, Mayank

arXiv.org Artificial Intelligence 

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of codemixing Figure 1: Example MCT and the corresponding article's to a multi-sentential framework and title form two multilingual data sources: (A) automatically identify MCT in the multilingual Dainik Jagran news article and (B) Man-ki-baat speech articles. The MUTANT dataset comprises transcript. We color code the tokens as: English, Hindi, 67k articles with 85k identified Hinglish and language independent.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found