reverberation
Is Phase Really Needed for Weakly-Supervised Dereverberation ?
Rodrigues, Marius, Bahrman, Louis, Badeau, Roland, Richard, Gaël
In unsupervised or weakly-supervised approaches for speech dereverberation, the target clean (dry) signals are considered to be unknown during training. In that context, evaluating to what extent information can be retrieved from the sole knowledge of reverberant (wet) speech becomes critical. This work investigates the role of the reverberant (wet) phase in the time-frequency domain. Based on Statistical Wave Field Theory, we show that late reverberation perturbs phase components with white, uniformly distributed noise, except at low frequencies. Consequently, the wet phase carries limited useful information and is not essential for weakly supervised dereverberation. To validate this finding, we train dereverberation models under a recent weak supervision framework and demonstrate that performance can be significantly improved by excluding the reverberant phase from the loss function.
- North America > United States > Maine (0.04)
- Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
EchoMark: Perceptual Acoustic Environment Transfer with Watermark-Embedded Room Impulse Response
Huang, Chenpei, Yao, Lingfeng, Lee, Kyu In, Zhang, Lan Emily, Chen, Xun, Pan, Miao
Acoustic Environment Matching (AEM) is the task of transferring clean audio into a target acoustic environment, enabling engaging applications such as audio dubbing and auditory immersive virtual reality (VR). Recovering similar room impulse response (RIR) directly from reverberant speech offers more accessible and flexible AEM solution. However, this capability also introduces vulnerabilities of arbitrary ``relocation" if misused by malicious user, such as facilitating advanced voice spoofing attacks or undermining the authenticity of recorded evidence. To address this issue, we propose EchoMark, the first deep learning-based AEM framework that generates perceptually similar RIRs with embedded watermark. Our design tackle the challenges posed by variable RIR characteristics, such as different durations and energy decays, by operating in the latent domain. By jointly optimizing the model with a perceptual loss for RIR reconstruction and a loss for watermark detection, EchoMark achieves both high-quality environment transfer and reliable watermark recovery. Experiments on diverse datasets validate that EchoMark achieves room acoustic parameter matching performance comparable to FiNS, the state-of-the-art RIR estimator. Furthermore, a high Mean Opinion Score (MOS) of 4.22 out of 5, watermark detection accuracy exceeding 99\%, and bit error rates (BER) below 0.3\% collectively demonstrate the effectiveness of EchoMark in preserving perceptual quality while ensuring reliable watermark embedding.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
Zheng, Junjie, Chen, Gongyu, Ding, Chaofan, Chen, Zihao
In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency ($F_0$) perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate DNSMOS-filtered separated vocals and public singing corpora, enabling the model to preserve speaker timbre while capturing singing style nuances. Third, we integrate the Neural Source-Filter (NSF) model to explicitly represent harmonic and noise components, enhancing the naturalness and controllability of converted singing. R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
Sci-Phi: A Large Language Model Spatial Audio Descriptor
Jiang, Xilin, Gamper, Hannes, Braun, Sebastian
Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo
LOTUSDIS: A Thai far-field meeting corpus for robust conversational ASR
Tipaksorn, Pattara, Thatphithakkul, Sumonmas, Chunwijitra, Vataya, Thangthai, Kwanchiva
We present LOTUSDIS, a publicly available Thai meeting corpus designed to advance far-field conversational ASR. The dataset comprises 114 hours of spontaneous, unscripted dialogue collected in 15-20 minute sessions with three participants, where overlapping speech is frequent and natural. Speech was recorded simultaneously by nine independent single-channel devices spanning six microphone types at distances from 0.12 m to 10 m, preserving the authentic effects of reverberation, noise, and device coloration without relying on microphone arrays. We provide standard train, dev, test splits and release a reproducible baseline system. We benchmarked several Whisper variants under zero-shot and fine-tuned conditions. Off-the-shelf models showed strong degradation with distance, confirming a mismatch between pre-training data and Thai far-field speech. Fine-tuning on LOTUSDIS dramatically improved robustness: a Thai Whisper baseline reduced overall WER from 64.3 to 38.3 and far-field WER from 81.6 to 49.5, with especially large gains on the most distant microphones. These results underscore the importance of distance-diverse training data for robust ASR. The corpus is available under CC-BY-SA 4.0. We also release training and evaluation scripts as a baseline system to promote reproducible research in this field.
U-DREAM: Unsupervised Dereverberation guided by a Reverberation Model
Bahrman, Louis, Fontaine, Mathieu, Richard, Gaël
--This paper explores the outcome of training state-of-the-art dereverberation models with supervision settings ranging from weakly-supervised to fully unsupervised, relying solely on reverberant signals and an acoustic model for training. Most of the existing deep learning approaches typically require paired dry and reverberant data, which are difficult to obtain in practice. We develop instead a sequential learning strategy motivated by a bayesian formulation of the dereverberation problem, wherein acoustic parameters and dry signals are estimated from reverberant inputs using deep neural networks, guided by a reverberation matching loss. COUSTIC waves propagation in enclosed environments is significantly influenced by reflections and diffractions from surrounding surfaces and objects. These interactions alter the original waveform and result in reverberation, which can be modeled as a superposition of delayed and attenuated versions of the source signal. Reverberation has long been recognized as a critical factor affecting speech intelligibility [1], and its detrimental effects on audio clarity have motivated decades of research. The task of reverberation suppression, commonly referred to as dereverberation, has received renewed attention in recent years due to its relevance in a wide range of audio processing applications. Effective dereverberation is essential in enhancing the performance of hearing aids [2], improving communication quality in hands-free [3] telephony, and enabling robust Automatic Speech Recognition (ASR) in human-machine interaction scenarios [4]. It also serves as a key preprocessing step in general-purpose speech enhancement frameworks [5]. Beyond suppression, reverberation itself plays a constructive role in audio production, particularly in simulating desired acoustic characteristics in post-processing. Reverberation conversion, or acoustic transfer, aims to transform a given recording, possibly containing unknown or undesired room effects, into a version consistent with a target acoustic environment. This work was funded by the European Union (ERC, HI-Audio, 101052978). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council.
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- North America > United States > Maine (0.04)
- Europe > France (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams
Grabovski, Freddie, Gressel, Gilad, Mirsky, Yisroel
--Large Language Models (LLMs), combined with T ext-to-Speech (TTS) and Automatic Speech Recognition (ASR), are increasingly used to automate voice phishing (vishing) scams. These systems are scalable and convincing, posing a significant security threat. We identify the ASR transcription step as the most vulnerable link in the scam pipeline and introduce ASRJam, a proactive defence framework that injects adversarial perturbations into the victim's audio to disrupt the attacker's ASR. This breaks the scam's feedback loop without affecting human callers, who can still understand the conversation. While prior adversarial audio techniques are often unpleasant and impractical for real-time use, we also propose EchoGuard, a novel jammer that leverages natural distortions, such as reverberation and echo, that are disruptive to ASR but tolerable to humans. T o evaluate EchoGuard's effectiveness and usability, we conducted a 39-person user study comparing it with three state-of-the-art attacks. Results show that EchoGuard achieved the highest overall utility, offering the best combination of ASR disruption and human listening experience. Large Language Models (LLMs) are now widely used across many applications, demonstrating impressive progress in understanding and generating natural language [1], [2], [3]. When combined with text-to-speech (TTS) and automatic speech recognition (ASR) technologies, LLMs enable powerful new capabilities such as automated customer service, outbound sales, cold calling, and advanced virtual assistants. However, as these systems become more realistic and lifelike, they also raise significant security concerns. LLMs have proven effective at generating phishing content that rivals human-written emails [4], [5], contributing to a 703% rise in credential phishing in 2024. The integration of LLMs with speech synthesis into real-time, automated scam agents is the inevitable conclusion [6]. V oice agents operate by chaining together a sequence of neural networks to handle calls in real time: (1) ASR transcribes the victim's speech into text, (2) An LLM generates an appropriate textual response, and (3) TTS synthesizes that response into natural-sounding audio. This pipeline enables scalable voice interactions that can convincingly impersonate trusted entities and extract sensitive information from victims as seen in Figure 1.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Teaching Physical Awareness to LLMs through Sounds
Wang, Weiguo, Nie, Andy, Zhou, Wenrui, Kai, Yi, Hu, Chengchen
Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awareness--understanding of real-world physical phenomena. In this work, we present ACORN, a framework that teaches LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, ACORN introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world.
TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network
Rong, Xiaobin, Wang, Dahan, Hu, Qinwen, Wang, Yushi, Hu, Yuxiang, Lu, Jing
Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.
- North America > United States > Rhode Island (0.04)
- North America > Canada (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (6 more...)
Source Separation of Small Classical Ensembles: Challenges and Opportunities
Roa-Dabike, Gerardo, Cox, Trevor J., Barker, Jon P., Akeroyd, Michael A., Bannister, Scott, Fazenda, Bruno, Firth, Jennifer, Graetzer, Simone, Greasley, Alinka, Vos, Rebecca R., Whitmer, William M.
Musical (MSS) source separation of western popular music using non-causal deep learning can be very effective. In contrast, MSS for classical music is an unsolved problem. Classical ensembles are harder to separate than popular music because of issues such as the inherent greater variation in the music; the sparsity of recordings with ground truth for supervised training; and greater ambiguity between instruments. The Cadenza project has been exploring MSS for classical music. This is being done so music can be remixed to improve listening experiences for people with hearing loss. To enable the work, a new database of synthesized woodwind ensembles was created to overcome instrumental imbalances in the EnsembleSet. For the MSS, a set of ConvTasNet models was used with each model being trained to extract a string or woodwind instrument. ConvTasNet was chosen because it enabled both causal and non-causal approaches to be tested. Non-causal approaches have dominated MSS work and are useful for recorded music, but for live music or processing on hearing aids, causal signal processing is needed. The MSS performance was evaluated on the two small datasets (Bach10 and URMP) of real instrument recordings where the ground-truth is available. The performances of the causal and non-causal systems were similar. Comparing the average Signal-to-Distortion (SDR) of the synthesized validation set (6.2 dB causal; 6.9 non-causal), to the real recorded evaluation set (0.3 dB causal, 0.4 dB non-causal), shows that mismatch between synthesized and recorded data is a problem. Future work needs to either gather more real recordings that can be used for training, or to improve the realism and diversity of the synthesized recordings to reduce the mismatch...
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine > Therapeutic Area > Otolaryngology (0.36)