target speaker
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States (0.04)
- North America > Canada > Quebec > Montreal (0.04)
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
Wu, Wenxuan, Wang, Shuai, Wu, Xixin, Meng, Helen, Li, Haizhou
Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demonstrate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.
Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM
Ishikura, Seiya, Yamada, Hiroaki, Hiraoka, Tatsuya, Yamada, Hiroaki, Tokunaga, Takenobu
This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.
- Europe > Austria > Vienna (0.14)
- North America > United States > Pennsylvania (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- (8 more...)
DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech
Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, Lee, Seong-Whan
Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection
Nejad, Mahsa Ghazvini, Asl, Hamed Jafarzadeh, Edraki, Amin, Sadeghi, Mohammadreza, Asgharian, Masoud, Yu, Yuanhao, Nia, Vahid Partovi
Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.
Content Anonymization for Privacy in Long-form Audio
Aggazzotti, Cristina, Garg, Ashi, Cai, Zexin, Andrews, Nicholas
Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.
- North America > United States > Maryland > Baltimore (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Target speaker anonymization in multi-speaker recordings
Tomashenko, Natalia, Yamagishi, Junichi, Wang, Xin, Liu, Yun, Vincent, Emmanuel
Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer's voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States (0.04)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
Multi-Target Backdoor Attacks Against Speaker Recognition
Fortier, Alexandrine, Joshi, Sonal, Thebaud, Thomas, Villalba, Jesús, Dehak, Najim, Cardinal, Patrick
--In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. T o simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker--based on cosine similarity--as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases. In recent years, speaker recognition systems have achieved strong performance. However, they remain susceptible to significant security risks, including malicious attacks [1]-[6]. Due to constraints in data and computational resources, many organizations rely on external providers for model development or data collection. A particularly concerning threat is backdoor attacks, which are introduced during training. The backdoor itself is a hidden mechanism the model learns during training: when a specific input pattern--known as a trigger--is present, the model consistently produces a target output, regardless of the true input.
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.73)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)
WildSpoof Challenge Evaluation Plan
Wu, Yihan, Jung, Jee-weon, Shim, Hye-jin, Cheng, Xin, Wang, Xin
The WildSpoof Challenge aims to advance the use of in-the-wild data in two intertwined speech processing tasks. It consists of two parallel tracks: (1) Text-to-Speech (TTS) synthesis for generating spoofed speech, and (2) Spoofing-robust Automatic Speaker Verification (SASV) for detecting spoofed speech. While the organizers coordinate both tracks and define the data protocols, participants treat them as separate and independent tasks. The primary objectives of the challenge are: (i) to promote the use of in-the-wild data for both TTS and SASV, moving beyond conventional clean and controlled datasets and considering real-world scenarios; and (ii) to encourage interdisciplinary collaboration between the spoofing generation (TTS) and spoofing detection (SASV) communities, thereby fostering the development of more integrated, robust, and realistic systems.