Goto

Collaborating Authors

 glottal source waveform


Modeling and Estimation of Vocal Tract and Glottal Source Parameters Using ARMAX-LF Model

Lia, Kai, Akagia, Masato, Lib, Yongwei, Unokia, Masashi

arXiv.org Artificial Intelligence

Modeling and estimation of the vocal tract and glottal source parameters of vowels from raw speech can be typically done by using the Auto-Regressive with eXogenous input (ARX) model and Liljencrants-Fant (LF) model with an iteration-based estimation approach. However, the all-pole autoregressive model in the modeling of vocal tract filters cannot provide the locations of anti-formants (zeros), which increases the estimation errors in certain classes of speech sounds, such as nasal, fricative, and stop consonants. In this paper, we propose the Auto-Regressive Moving Average eXogenous with LF (ARMAX-LF) model to extend the ARX-LF model to a wider variety of speech sounds, including vowels and nasalized consonants. The LF model represents the glottal source derivative as a parametrized time-domain model, and the ARMAX model represents the vocal tract as a pole-zero filter with an additional exogenous LF excitation as input. To estimate multiple parameters with fewer errors, we first utilize the powerful nonlinear fitting ability of deep neural networks (DNNs) to build a mapping from extracted glottal source derivatives or speech waveforms to corresponding LF parameters. Then, glottal source and vocal tract parameters can be estimated with fewer estimation errors and without any iterations as in the analysis-by-synthesis strategy. Experimental results with synthesized speech using the linear source-filter model, synthesized speech using the physical model, and real speech signals showed that the proposed ARMAX-LF model with a DNN-based estimation method can estimate the parameters of both vowels and nasalized sounds with fewer errors and estimation time.


Analysis and Detection of Pathological Voice using Glottal Source Features

Kadiri, Sudarsana Reddy, Alku, Paavo

arXiv.org Artificial Intelligence

Automatic detection of voice pathology enables objective assessment and earlier intervention for the diagnosis. This study provides a systematic analysis of glottal source features and investigates their effectiveness in voice pathology detection. Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method, using approximate glottal source signals computed with the zero frequency filtering (ZFF) method, and using acoustic voice signals directly. In addition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from the glottal source waveforms computed by QCP and ZFF to effectively capture the variations in glottal source spectra of pathological voice. Experiments were carried out using two databases, the Hospital Universitario Principe de Asturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database. Analysis of features revealed that the glottal source contains information that discriminates normal and pathological voice. Pathology detection experiments were carried out using support vector machine (SVM). From the detection experiments it was observed that the performance achieved with the studied glottal source features is comparable or better than that of conventional MFCCs and perceptual linear prediction (PLP) features. The best detection performance was achieved when the glottal source features were combined with the conventional MFCCs and PLP features, which indicates the complementary nature of the features.


Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

Kadiri, Sudarsana Reddy, Javanmardi, Farhad, Alku, Paavo

arXiv.org Artificial Intelligence

Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.