Goto

Collaborating Authors

 Umesh, S


SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, IIT Madras

arXiv.org Artificial Intelligence

To increase the internet content of Indian Languages in different domains India is home to a multitude of languages of which 22 languages are recognised by the Indian Constitution as official. As part of the Speech Consortium of the NLTM-R&D Building speech based applications for the Indian population which is led by Indian Institute of Technology Madras is a difficult problem owing to limited data and the number (IITM), SPRING Lab of IITM has collected and is collecting of languages and accents to accommodate. To encourage the legally sourced and manually transcribed speech corpus in language technology community to build speech based applications various Indian languages such as Tamil, Hindi, Indian English, in Indian languages, we are open sourcing SPRING-Marathi, Bengali, Malayalam, Telugu, Assamese, Kannada, INX data which has about 2000 hours of legally sourced and Gujarati, Odia, Punjabi. Bodo and Manipuri through manually transcribed speech data for ASR system building speech data collection agencies identified using a tendering in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, process. The data collected has been carefully evaluated by Marathi, Odia, Punjabi and Tamil. This endeavor is by the Speech Quality Control (SQC) team led by KL University. SPRING Lab, Indian Institute of Technology Madras and is We are releasing the first set of valuable data amounting a part of National Language Translation Mission (NLTM), to 2000 hours (both Audio and corresponding manually transcribed funded by the Indian Ministry of Electronics and Information transcriptions) which was collected, cleaned and prepared Technology (MeitY), Government of India. We describe the for ASR system building in 10 Indian languages such data collection and data cleaning process along with the data as Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, statistics in this paper.


SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

arXiv.org Artificial Intelligence

While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to having the reconstruction loss. The richness of speech characteristics from the SSL features reflects in the output speech quality, with the objective and subjective evaluation measures of the proposed approach outperforming the baseline FastSpeech2.


Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

arXiv.org Artificial Intelligence

Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%.