SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, IIT Madras
R, Nithya, S, Malavika, F, Jordan, Gangwar, Arjun, J, Metilda N, Umesh, S, Sarab, Rithik, Dubey, Akhilesh Kumar, Divakaran, Govind, K, Samudra Vijaya, Gangashetty, Suryakanth V
–arXiv.org Artificial Intelligence
To increase the internet content of Indian Languages in different domains India is home to a multitude of languages of which 22 languages are recognised by the Indian Constitution as official. As part of the Speech Consortium of the NLTM-R&D Building speech based applications for the Indian population which is led by Indian Institute of Technology Madras is a difficult problem owing to limited data and the number (IITM), SPRING Lab of IITM has collected and is collecting of languages and accents to accommodate. To encourage the legally sourced and manually transcribed speech corpus in language technology community to build speech based applications various Indian languages such as Tamil, Hindi, Indian English, in Indian languages, we are open sourcing SPRING-Marathi, Bengali, Malayalam, Telugu, Assamese, Kannada, INX data which has about 2000 hours of legally sourced and Gujarati, Odia, Punjabi. Bodo and Manipuri through manually transcribed speech data for ASR system building speech data collection agencies identified using a tendering in Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, process. The data collected has been carefully evaluated by Marathi, Odia, Punjabi and Tamil. This endeavor is by the Speech Quality Control (SQC) team led by KL University. SPRING Lab, Indian Institute of Technology Madras and is We are releasing the first set of valuable data amounting a part of National Language Translation Mission (NLTM), to 2000 hours (both Audio and corresponding manually transcribed funded by the Indian Ministry of Electronics and Information transcriptions) which was collected, cleaned and prepared Technology (MeitY), Government of India. We describe the for ASR system building in 10 Indian languages such data collection and data cleaning process along with the data as Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, statistics in this paper.
arXiv.org Artificial Intelligence
Oct-24-2023
- Genre:
- Research Report (0.40)
- Industry:
- Government > Regional Government > Asia Government > India Government (1.00)
- Technology: