Goto

Collaborating Authors

 intelligibility



NeuralAdditiveModels: InterpretableMachineLearningwithNeuralNets

Neural Information Processing Systems

They perform similarly to existing state-of-the-art generalized additive models in accuracy,but are more flexible because theyare based on neural nets instead ofboosted trees.


Neural Additive Models: Interpretable Machine Learning with Neural Nets

Neural Information Processing Systems

Deep neural networks (DNNs) are powerful black-box predictors that have achieved impressive performance on a wide variety of tasks. However, their accuracy comes at the cost of intelligibility: it is usually unclear how they make their decisions. This hinders their applicability to high stakes decision-making domains such as healthcare. We propose Neural Additive Models (NAMs) which combine some of the expressivity of DNNs with the inherent intelligibility of generalized additive models. NAMs learn a linear combination of neural networks that each attend to a single input feature. These networks are trained jointly and can learn arbitrarily complex relationships between their input feature and the output. Our experiments on regression and classification datasets show that NAMs are more accurate than widely used intelligible models such as logistic regression and shallow decision trees. They perform similarly to existing state-of-the-art generalized additive models in accuracy, but are more flexible because they are based on neural nets instead of boosted trees. To demonstrate this, we show how NAMs can be used for multitask learning on synthetic data and on the COMPAS recidivism data due to their composability, and demonstrate that the differentiability of NAMs allows them to train more complex interpretable models for COVID-19.


Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

Phukon, Bornali, Zheng, Xiuwen, Hasegawa-Johnson, Mark

arXiv.org Artificial Intelligence

Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.


YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Chen, Gongyu, Zhang, Xiaoyu, Weng, Zhenqiang, Zheng, Junjie, Shen, Da, Ding, Chaofan, Zhang, Wei-Qiang, Chen, Zihao

arXiv.org Artificial Intelligence

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.


Comparative Evaluation of Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

Rackauckas, Zackary, Hirschberg, Julia

arXiv.org Artificial Intelligence

Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper empirically evaluates two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.


Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion

Bahmei, Behnaz, Arzanpour, Siamak, Birmingham, Elina

arXiv.org Artificial Intelligence

Speech quality and intelligibility are significantly degraded in noisy environments. This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem for real-time applications. Although existing deep learning networks have shown remarkable improvements in handling stationary noise, their performance often diminishes in real-world environments characterized by non-stationary noise (e.g., dog barking, baby crying). The proposed dual-input acoustic-image feature fusion using a hybrid ViT framework effectively models both temporal and spectral dependencies in noisy signals. Designed for real-world audio environments, the proposed framework is computationally lightweight and suitable for implementation on embedded devices. T o evaluate its effectiveness, four standard and commonly used quality measurements, namely PESQ, STOI, Seg SNR, and LLR, are utilized. Experimental results obtained using the Librispeech dataset as the clean speech source and the Ur-banSound8K and Google Audioset datasets as the noise sources, demonstrate that the proposed method significantly improves noise reduction, speech intelligibility, and perceptual quality compared to the noisy input signal, achieving performance close to the clean reference.


Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space Model

Sung, Ching-Chih, Suzuki, Shuntaro, Chien, Francis Pingfan, Sugiura, Komei, Tsao, Yu

arXiv.org Artificial Intelligence

Clarifying the neural basis of speech intelligibility is critical for computational neuroscience and digital speech processing. Recent neuroimaging studies have shown that intelligibility modulates cortical activity beyond simple acoustics, primarily in the superior temporal and inferior frontal gyri. However, previous studies have been largely confined to clean speech, leaving it unclear whether the brain employs condition-invariant neural codes across diverse listening environments. To address this gap, we propose a novel architecture built upon a deep state space model for decoding intelligibility from fMRI signals, specifically tailored to their high-dimensional temporal structure. We present the first attempt to decode intelligibility across acoustically distinct conditions, showing our method significantly outperforms classical approaches. Furthermore, region-wise analysis highlights contributions from auditory, frontal, and parietal regions, and cross-condition transfer indicates the presence of condition-invariant neural codes, thereby advancing understanding of abstract linguistic representations in the brain.


SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Tee, Hitomi Jin Ling, Wang, Chaoren, Zhang, Zijie, Wu, Zhizheng

arXiv.org Artificial Intelligence

ABSTRACT The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose SP-MCQA (Spoken-Passage Multiple-Choice Question Answering), a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOT A) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.


SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

Donepudi, Dharma Teja

arXiv.org Artificial Intelligence

Intra - sentence multilingual speech synthesis (code - switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed - language contexts. We introduce Script - First Multilingual Synthesis with Adaptive Locale Resolution (SFMS - ALR) an engine - agnostic framework for fluent, real - time code - switched speech generation. SFMS - ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment - aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate or spans and synthesizes the utterance in a single TTS request. Unlike end - to - end multilingual models, SFMS - ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data - driven pipelines such as Unicom and Mask LID demonstrates SFMS - ALR's flexibility, interpretability, and immediate deployability . The framework establishes a modular baseline for high - quality, engine - independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.