Goto

Collaborating Authors

 diphthong


Sperm whales use vowels like humans, new study finds

Popular Science

Scientists decoding whale clicks found patterns that echo the building blocks of human speech. The marine mammals have a complex communication system that scientists are working to decode. Breakthroughs, discoveries, and DIY tips sent every weekday. A new study discovered a fresh component of their various vocalizations and could hint at potential language structures. Sperm whales exhibit patterns similar to human vowels and diphthongs-a connected pair of vowels in a word, such as the "oi" in .


Towards a dynamical model of English vowels. Evidence from diphthongisation

Strycharczuk, Patrycja, Kirkham, Sam, Gorman, Emily, Nagamine, Takayuki

arXiv.org Artificial Intelligence

Diphthong vowels exhibit a degree of inherent dynamic change, the extent of which can vary synchronically and diachronically, such that diphthong vowels can become monophthongs and vice versa. Modelling this type of change requires defining diphthongs in opposition to monophthongs. However, formulating an explicit definition has proven elusive in acoustics and articulation, as diphthongisation is often gradient in these domains. In this study, we consider whether diphthong vowels form a coherent phonetic category from the articulatory point of view. We present articulometry and acoustic data from six speakers of Northern Anglo-English producing a full set of phonologically long vowels. We analyse several measures of diphthongisation, all of which suggest that diphthongs are not categorically distinct from long monophthongs. We account for this observation with an Articulatory Phonology/Task Dynamic model in which diphthongs and long monophthongs have a common gestural representation, comprising two articulatory targets in each case, but they differ according to gestural constriction and location of the component gestures. We argue that a two-target representation for all long vowels is independently supported by phonological weight, as well as by the nature of historical diphthongisation and present-day dynamic vowel variation in British English.


Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Watcharasupat, Karn N., Wu, Chih-Wei, Orife, Iroro

arXiv.org Artificial Intelligence

Cinematic audio source separation (CASS) is a relatively new subtask of audio source separation, concerned with the separation of a mixture into the dialogue, music, and effects stems. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets.


IPA Transcription of Bengali Texts

Fatema, Kanij, Haider, Fazle Dawood, Turpa, Nirzona Ferdousi, Azmal, Tanveer, Ahmed, Sourav, Hasan, Navid, Rahman, Mohammad Akhlaqur, Sarkar, Biplab Kumar, Jahin, Afrar, Hassan, Md. Rezuwan, Zihad, Md Foriduzzaman, Faruque, Rubayet Sabbir, Sushmit, Asif, Imtiaz, Mashrur, Sadeque, Farig, Rahman, Syed Shahrier

arXiv.org Artificial Intelligence

The International Phonetic Alphabet (IPA) serves to systematize phonemes in language, enabling precise textual representation of pronunciation. In Bengali phonology and phonetics, ongoing scholarly deliberations persist concerning the IPA standard and core Bengali phonemes. This work examines prior research, identifies current and potential issues, and suggests a framework for a Bengali IPA standard, facilitating linguistic analysis and NLP resource creation and downstream technology development. In this work, we present a comprehensive study of Bengali IPA transcription and introduce a novel IPA transcription framework incorporating a novel dataset with DL-based benchmarks.


Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

Alku, Paavo, Kadiri, Sudarsana Reddy, Gowda, Dhananjaya

arXiv.org Artificial Intelligence

In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In the proposed refinement approach, the contours of the three lowest formants are first predicted by the data-driven DeepFormants tracker, and the predicted formants are replaced frame-wise with local spectral peaks shown by the model-driven LP-based methods. The refinement procedure can be plugged into the DeepFormants tracker with no need for any new data learning. Two refined DeepFormants trackers were compared with the original DeepFormants and with five known traditional trackers using the popular vocal tract resonance (VTR) corpus. The results indicated that the data-driven DeepFormants trackers outperformed the conventional trackers and that the best performance was obtained by refining the formants predicted by DeepFormants using QCP-FB analysis. In addition, by tracking formants using VTR speech that was corrupted by additive noise, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers. In general, these results suggest that LP-based model-driven approaches, which have traditionally been used in formant estimation, can be combined with a modern data-driven tracker easily with no further training to improve the tracker's performance.


OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking

Rakib, Fazle Rabbi, Dip, Souhardya Saha, Alam, Samiul, Tasnim, Nazia, Shihab, Md. Istiak Hossain, Ansary, Md. Nazmuddoha, Hossen, Syed Mobassir, Meghla, Marsia Haque, Mamun, Mamunur, Sadeque, Farig, Chowdhury, Sayma Sultana, Reasat, Tahsin, Sushmit, Asif, Humayun, Ahmed Imtiaz

arXiv.org Artificial Intelligence

Being one of the most spoken languages globally, Bengali portrays large diversity in dialects and prosodic features, which demands ASR frameworks to be robust towards distribution shifts. For example, islamic religious sermons in Bengali are delivered with a tonality that is significantly different from regular speech. Our training dataset is collected via massively online crowdsourcing campaigns which resulted in 1177.94 hours collected and curated from 22, 645 native Bengali speakers from South Asia. Our test dataset comprises 23.03 hours of speech collected and manually annotated from 17 different sources, e.g., Bengali TV drama, Audiobook, Talk show, Online class, and Islamic sermons to name a few. OOD-Speech is jointly the largest publicly available speech dataset, as well as the first out-ofdistribution Figure 1: t-Stochastic Neighbor Embeddings [6] of Geneva ASR benchmarking dataset for Bengali.