AITopics | voice quality

Collaborating Authors

voice quality

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

344ef5151be171062f42f03e69663ecf-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 10:30:40 GMT

artificial intelligence, machine learning, speech-t, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Ariyanti, Whenty, Chen, Kuan-Yu, Siniscalchi, Sabato Marco, Wang, Hsin-Min, Tsao, Yu

arXiv.org Artificial IntelligenceDec-12-2025

Abstract-- Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VO-QANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors--namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess general-izability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level. Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications. This work highlights the value of combining SFM embeddings with low-level features for accurate and robust pathological voice assessment. This work was supported in part by the National Science and T ech-nology Council and Academia Sinica. Whenty Ariyanti is with the Department of Computer Science and Information Engineering, National T aiwan University of Science and T echnology, T aipei 106, T aiwan, and also with the Research Center for Information T echnology Innovation, Academia Sinica, T aipei 11529, T aiwan (e-mail: d11115805@mail.ntust.edu.tw). Kuan-Y u Chen is with the Department of Computer Science and Information Engineering, National T aiwan University of Science and T echnology, T aipei 106, T aiwan (e-mail: kychen@mail.ntust.edu.tw). Sabato Marco Siniscalchi is the University of Palermo, Palermo, Italy (e-mail: sabatomarco.siniscalchi@unipa.it).

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.21356

Country: Europe > Italy > Sicily > Palermo (0.24)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Health Care Technology (0.68)
Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (0.68)
Health & Medicine > Therapeutic Area > Musculoskeletal (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

Lameris, Harm, Satish, Shree Harsha Bokkahalli, Gustafson, Joakim, Székely, Éva

arXiv.org Artificial IntelligenceOct-30-2025

Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion recognition, evaluating whether model behaviours are consistent across different phonation inputs. We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice. Our work provides the first examination of SFM sensitivity to these particular non-lexical aspects of speech perception.

artificial intelligence, category, voice quality, (17 more...)

arXiv.org Artificial Intelligence

2510.25577

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Speech (1.00)

Add feedback

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

Satish, Shree Harsha Bokkahalli, Lameris, Harm, Perrotin, Olivier, Henter, Gustav Eje, Székely, Éva

arXiv.org Artificial IntelligenceSep-29-2025

Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.22061

Country: Europe > France (0.15)

Genre: Research Report > New Finding (0.35)

Industry: Media (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

Narain, Jaya, Kowtha, Vasudha, Lea, Colin, Tooley, Lauren, Yee, Dianna, Mitra, Vikramjit, Huang, Zifang, Marques, Miquel Espi, Huang, Jon, Avendano, Carlos, Ren, Shirley

arXiv.org Artificial IntelligenceMay-29-2025

Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations. Here we develop and evaluate voice quality models for seven voice and speech dimensions (intelligibility, imprecise consonants, harsh voice, naturalness, monoloudness, monopitch, and breathiness). Probes were trained on the public Speech Accessibility (SAP) project dataset with 11,184 samples from 434 speakers, using embeddings from frozen pre-trained models as features. We found that our probes had both strong performance and strong generalization across speech elicitation categories in the SAP dataset. We further validated zero-shot performance on additional datasets, encompassing unseen languages and tasks: Italian atypical speech, English atypical speech, and affective speech. The strong zero-shot performance and the interpretability of results across an array of evaluations suggests the utility of using voice quality dimensions in speaking style-related tasks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.21809

Country: North America > United States > California (0.28)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Natural Language (0.71)

Add feedback

Disentangling segmental and prosodic factors to non-native speech comprehensibility

Quamer, Waris, Gutierrez-Osuna, Ricardo

arXiv.org Artificial IntelligenceAug-20-2024

Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker's segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity. We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech. Our results indicate that, contrary to prior research in non-native speech, segmental features have a larger impact on comprehensibility than prosody. The proposed AC system may also be used to study how segmental and prosody cues affect social attitudes towards non-native speech.

comprehensibility, prosody, utterance, (15 more...)

arXiv.org Artificial Intelligence

2408.10997

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models

Netzorg, Robin, Jalal, Ajil, McNulty, Luna, Anumanchipalli, Gopala Krishna

arXiv.org Artificial IntelligenceDec-13-2023

Perceptual modification of voice is an elusive goal. While non-experts can modify an image or sentence perceptually with available tools, it is not clear how to similarly modify speech along perceptual axes. Voice conversion does make it possible to convert one voice to another, but these modifications are handled by black box models, and the specifics of what perceptual qualities to modify and how to modify them are unclear. Towards allowing greater perceptual control over voice, we introduce PerMod, a conditional latent diffusion model that takes in an input voice and a perceptual qualities vector, and produces a voice with the matching perceptual qualities. Unlike prior work, PerMod generates a new voice corresponding to specific perceptual modifications. Evaluating perceptual quality vectors with RMSE from both human and predicted labels, we demonstrate that PerMod produces voices with the desired perceptual qualities for typical voices, but performs poorly on atypical voices.

modification, perceptual quality, permod, (13 more...)

arXiv.org Artificial Intelligence

2312.08494

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

Lin, Yi-Heng, Tseng, Wen-Hsuan, Chen, Li-Chin, Tan, Ching-Ting, Tsao, Yu

arXiv.org Artificial IntelligenceNov-27-2023

The Consensus Auditory-Perceptual Evaluation of Voice is a widely employed tool in clinical voice quality assessment that is significant for streaming communication among clinical professionals and benchmarking for the determination of further treatment. Currently, because the assessment relies on experienced clinicians, it tends to be inconsistent, and thus, difficult to standardize. To address this problem, we propose to leverage lightly weighted automatic audio parameter extraction, to increase the clinical relevance, reduce the complexity, and enhance the interpretability of voice quality assessment. The proposed method utilizes age, sex, and five audio parameters: jitter, absolute jitter, shimmer, harmonic-to-noise ratio (HNR), and zero crossing. A classical machine learning approach is employed. The result reveals that our approach performs similar to state-of-the-art (SOTA) methods, and outperforms the latent representation obtained by using popular audio pre-trained models. This approach provide insights into the feasibility of different feature extraction approaches for voice evaluation. Audio parameters such as jitter and the HNR are proven to be suitable for characterizing voice quality attributes, such as roughness and strain. Conversely, pre-trained models exhibit limitations in effectively addressing noise-related scorings. This study contributes toward more comprehensive and precise voice quality evaluations, achieved by a comprehensively exploring diverse assessment methodologies.

evaluation, extraction, pre-trained model, (13 more...)

arXiv.org Artificial Intelligence

2311.15582

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States (0.04)

Genre: Research Report > New Finding (0.47)

Industry: Health & Medicine > Therapeutic Area (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

Kadiri, Sudarsana Reddy, Javanmardi, Farhad, Alku, Paavo

arXiv.org Artificial IntelligenceAug-6-2023

Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting MFCCs and glottal source features. This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a SVM as well as CNNs as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are estimated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conventional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The results revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. It was also found that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features.

artificial intelligence, glottal source waveform, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.csl.2023.101550

2308.03226

Country:

Europe > Finland (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > India (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.93)
Government > Regional Government (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.75)

Add feedback

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Wang, Yuyue, Xiao, Huan, Wu, Yihan, Song, Ruihua

arXiv.org Artificial IntelligenceMay-20-2023

Text to Speech (TTS) models can generate natural and high-quality speech, but it is not expressive enough when synthesizing speech with dramatic expressiveness, such as stand-up comedies. Considering comedians have diverse personal speech styles, including personal prosody, rhythm, and fillers, it requires real-world datasets and strong speech style modeling capabilities, which brings challenges. In this paper, we construct a new dataset and develop ComedicSpeech, a TTS system tailored for the stand-up comedy synthesis in low-resource scenarios. First, we extract prosody representation by the prosody encoder and condition it to the TTS model in a flexible way. Second, we enhance the personal rhythm modeling by a conditional duration predictor. Third, we model the personal fillers by introducing comedian-related special tokens. Experiments show that ComedicSpeech achieves better expressiveness than baselines with only ten-minute training data for each comedian. The audio samples are available at https://xh621.github.io/stand-up-comedy-demo/

artificial intelligence, comedian, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.122

Country:

Asia > South Korea > Incheon > Incheon (0.05)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Austria (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback