Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

Sandhan, Jivnesh, Daksh, Ayush, Paranjay, Om Adideva, Behera, Laxmidhar, Goyal, Pawan

Sep-4-2022–arXiv.org Artificial Intelligence

Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at https://github.com/frozentoad9/CMST.

artificial intelligence, machine translation, natural language, (16 more...)

arXiv.org Artificial Intelligence

Sep-4-2022

arXiv.org PDF

Add feedback

Country:
- South America > Brazil (0.04)
- North America > United States
  - Pennsylvania (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - California > San Diego County
    - San Diego (0.04)
- Europe
  - Spain (0.05)
  - Sweden > Vaestra Goetaland
    - Gothenburg (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.05)
- Asia
  - East Asia (0.04)
  - India > West Bengal
    - Kharagpur (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found