MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

Gupta, Rahul, Srivastava, Vivek, Singh, Mayank

Feb-22-2023–arXiv.org Artificial Intelligence

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of codemixing Figure 1: Example MCT and the corresponding article's to a multi-sentential framework and title form two multilingual data sources: (A) automatically identify MCT in the multilingual Dainik Jagran news article and (B) Man-ki-baat speech articles. The MUTANT dataset comprises transcript. We color code the tokens as: English, Hindi, 67k articles with 85k identified Hinglish and language independent.

artificial intelligence, dataset, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-22-2023

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > India
  - Gujarat > Gandhinagar (0.05)
  - Maharashtra > Pune (0.04)

Genre:
- Research Report (1.00)

Industry:
- Government (0.68)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found