acoustic feature
Forensic deepfake audio detection using segmental speech features
Yang, Tianle, Sun, Chengzhe, Lyu, Siwei, Rose, Phil
This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison (FVC) are effective in identifying deep-fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection using methods that are distinct from those employed in traditional FVC, and offer a new perspective on leveraging segmental features for this purpose. In addition, the present study proposes a speaker-specific framework for deepfake detection, which differs fundamentally from the speaker-independent systems that dominate current benchmarks. While speaker-independent frameworks aim at broad generalization, the speaker-specific approach offers advantages in forensic contexts where case-by-case interpretability and sensitivity to individual phonetic realization are essential.
- North America > United States > California (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- (4 more...)
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits
Ishii, Masato, Hayakawa, Akio, Shibuya, Takashi, Mitsufuji, Yuki
W e introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. T o achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. W e extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features
Talpur, Unzela, Syed, Zafi Sherhan, Syed, Muhammad Shehram Shah, Syed, Abbas Shah
Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu. This study investigates Urdu SER in a cross-corpus setting, an area that has remained largely unexplored. We employ a cross-corpus evaluation framework across three different Urdu emotional speech datasets to test model generalization. Two standard domain-knowledge based acoustic feature sets, eGeMAPS and ComParE, are used to represent speech signals as feature vectors which are then passed to Logistic Regression and Multilayer Perceptron classifiers. Classification performance is assessed using unweighted average recall (UAR) whilst considering class-label imbalance. Results show that Self-corpus validation often overestimates performance, with UAR exceeding cross-corpus evaluation by up to 13%, underscoring that cross-corpus evaluation offers a more realistic measure of model robustness. Overall, this work emphasizes the importance of cross-corpus validation for Urdu SER and its implications contribute to advancing affective computing research for underrepresented language communities.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)
Acoustic and Machine Learning Methods for Speech-Based Suicide Risk Assessment: A Systematic Review
Marie, Ambre, Garnier, Marine, Bertin, Thomas, Machart, Laura, Dardenne, Guillaume, Quellec, Gwenolé, Berrouiguet, Sofian
Suicide remains a public health challenge, necessitating improved detection methods to facilitate timely intervention and treatment. This systematic review evaluates the role of Artificial Intelligence (AI) and Machine Learning (ML) in assessing suicide risk through acoustic analysis of speech. Following PRISMA guidelines, we analyzed 33 articles selected from PubMed, Cochrane, Scopus, and Web of Science databases. The last search was conducted in February 2025. Risk of bias was assessed using the PROBAST tool. Studies analyzing acoustic features between individuals at risk of suicide (RS) and those not at risk (NRS) were included, while studies lacking acoustic data, a suicide-related focus, or sufficient methodological details were excluded. Sample sizes varied widely and were reported in terms of participants or speech segments, depending on the study. Results were synthesized narratively based on acoustic features and classifier performance. Findings consistently showed significant acoustic feature variations between RS and NRS populations, particularly involving jitter, fundamental frequency (F0), Mel-frequency cepstral coefficients (MFCC), and power spectral density (PSD). Classifier performance varied based on algorithms, modalities, and speech elicitation methods, with multimodal approaches integrating acoustic, linguistic, and metadata features demonstrating superior performance. Among the 29 classifier-based studies, reported AUC values ranged from 0.62 to 0.985 and accuracies from 60% to 99.85%. Most datasets were imbalanced in favor of NRS, and performance metrics were rarely reported separately by group, limiting clear identification of direction of effect.
- North America > United States > California (0.14)
- Asia > China > Guangdong Province (0.14)
- Asia > China > Beijing > Beijing (0.04)
- (15 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation
Lee, Sangmin, Choi, Woongjib, Kim, Jihyun, Kang, Hong-Goo
ABSTRACT In this paper, we present a neural spoken language di-arization model that supports an unconstrained span of languages within a single framework. Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data. By jointly leveraging these two components, our method overcomes the limitations of conventional approaches in data scarcity and architecture optimization, and generalizes effectively to real-world multilingual settings across diverse environments. Experimental results demonstrate that our approach achieves state-of-the-art performance on several language diarization benchmarks, with a relative performance improvement of 23% to 52% over previous methods. We believe that this work not only advances research in language diarization but also establishes a founda-tional framework for code-switching speech technologies.
Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
Shi, Jiacheng, Du, Hongfei, Hong, Y. Alicia, Gao, Ye
Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.
Toward a Realistic Encoding Model of Auditory Affective Understanding in the Brain
Pan, Guandong, Yang, Yaqian, Chen, Shi, Wang, Xin, Liu, Longzhao, Zheng, Hongwei, Tang, Shaoting
In affective neuroscience and emotion-aware AI, understanding how complex auditory stimuli drive emotion arousal dynamics remains unresolved. This study introduces a computational framework to model the brain's encoding of naturalistic auditory inputs into dynamic behavioral/neural responses across three datasets (SEED, LIRIS, self-collected BAVE). Guided by neurobiological principles of parallel auditory hierarchy, we decompose audio into multilevel auditory features (through classical algorithms and wav2vec 2.0/Hubert) from the original and isolated human voice/background soundtrack elements, mapping them to emotion-related responses via cross-dataset analyses. Our analysis reveals that high-level semantic representations (derived from the final layer of wav2vec 2.0/Hubert) exert a dominant role in emotion encoding, outperforming low-level acoustic features with significantly stronger mappings to behavioral annotations and dynamic neural synchrony across most brain regions ($p < 0.05$). Notably, middle layers of wav2vec 2.0/hubert (balancing acoustic-semantic information) surpass the final layers in emotion induction across datasets. Moreover, human voices and soundtracks show dataset-dependent emotion-evoking biases aligned with stimulus energy distribution (e.g., LIRIS favors soundtracks due to higher background energy), with neural analyses indicating voices dominate prefrontal/temporal activity while soundtracks excel in limbic regions. By integrating affective computing and neuroscience, this work uncovers hierarchical mechanisms of auditory-emotion encoding, providing a foundation for adaptive emotion-aware systems and cross-disciplinary explorations of audio-affective interactions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model
Choi, Junhyuk, Oh, Ro-hoon, Seol, Jihwan, Kim, Bugeun
We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > United Kingdom (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (3 more...)
ClonEval: An Open Voice Cloning Benchmark
Christop, Iwona, Kuczyński, Tomasz, Kubis, Marek
ABSTRACT We present a new benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of the leaderboard. The evaluation results of selected open-source models are reported.
- Europe > Poland > Greater Poland Province > Poznań (0.40)
- North America > Canada > Ontario > Toronto (0.04)
Detection of Disease on Nasal Breath Sound by New Lightweight Architecture: Using COVID-19 as An Example
She, Jiayuan, Shi, Lin, Li, Peiqi, Dong, Ziling, Li, Renxing, Li, Shengkai, Gu, Liping, Tong, Zhao, Yang, Zhuochang, Ji, Yajie, Feng, Liang, Chen, Jiangang
Background. Infectious diseases, particularly COVID-19, continue to be a significant global health issue. Although many countries have reduced or stopped large-scale testing measures, the detection of such diseases remains a propriety. Objective. This study aims to develop a novel, lightweight deep neural network for efficient, accurate, and cost-effective detection of COVID-19 using a nasal breathing audio data collected via smartphones. Methodology. Nasal breathing audio from 128 patients diagnosed with the Omicron variant was collected. Mel-Frequency Cepstral Coefficients (MFCCs), a widely used feature in speech and sound analysis, were employed for extracting important characteristics from the audio signals. Additional feature selection was performed using Random Forest (RF) and Principal Component Analysis (PCA) for dimensionality reduction. A Dense-ReLU-Dropout model was trained with K-fold cross-validation (K=3), and performance metrics like accuracy, precision, recall, and F1-score were used to evaluate the model. Results. The proposed model achieved 97% accuracy in detecting COVID-19 from nasal breathing sounds, outperforming state-of-the-art methods such as those by [23] and [13]. Our Dense-ReLU-Dropout model, using RF and PCA for feature selection, achieves high accuracy with greater computational efficiency compared to existing methods that require more complex models or larger datasets. Conclusion. The findings suggest that the proposed method holds significant potential for clinical implementation, advancing smartphone-based diagnostics in infectious diseases. The Dense-ReLU-Dropout model, combined with innovative feature processing techniques, offers a promising approach for efficient and accurate COVID-19 detection, showcasing the capabilities of mobile device-based diagnostics
- Asia > China > Shanghai > Shanghai (0.06)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Asia > China > Zhejiang Province (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.88)