speaker similarity
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
Wang, Kaidi, He, Yi, Guan, Wenhao, Wu, Weijie, Ding, Hongwu, Zhang, Xiong, Wu, Di, Meng, Meng, Luan, Jian, Li, Lin, Hong, Qingyang
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
PitchFlower: A flow-based neural audio codec with pitch controllability
Torres, Diego, Roebel, Axel, Obin, Nicolas
Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFi-GAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.
R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
Zheng, Junjie, Chen, Gongyu, Ding, Chaofan, Chen, Zihao
In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency ($F_0$) perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate DNSMOS-filtered separated vocals and public singing corpora, enabling the model to preserve speaker timbre while capturing singing style nuances. Third, we integrate the Neural Source-Filter (NSF) model to explicitly represent harmonic and noise components, enhancing the naturalness and controllability of converted singing. R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion
Choi, Joon-Seung, Byun, Dong-Min, Oh, Hyung-Seok, Lee, Seong-Whan
Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
Bai, Bingsong, Lu, Qihang, Yang, Wenbing, Sun, Zihan, Hou, Yueran, Jia, Peilei, Pu, Songbai, Fu, Ruibo, Gao, Yingming, Li, Ya, Gao, Jun
ABSTRACT Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the Syn-ParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale par-alinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguis-tic event detection.
- North America > United States > California (0.04)
- Asia > China > Beijing > Beijing (0.04)
Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias
Satish, Shree Harsha Bokkahalli, Lameris, Harm, Perrotin, Olivier, Henter, Gustav Eje, Székely, Éva
Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)