style encoder
Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose SpotlightTTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)
MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation
Liu, Lianlian, He, YongKang, Chu, Zhaojie, Xing, Xiaofen, Xu, Xiangmin
Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.
DiffStyleTS: Diffusion Model for Style Transfer in Time Series
Nagda, Mayank, Ostheimer, Phil, Arweiler, Justus, Jungjohann, Indra, Werner, Jennifer, Wagner, Dennis, Muraleedharan, Aparna, Jafari, Pouya, Schmid, Jochen, Jirasek, Fabian, Burger, Jakob, Bortz, Michael, Hasse, Hans, Mandt, Stephan, Kloft, Marius, Fellenz, Sophie
Style transfer combines the content of one signal with the style of another. It supports applications such as data augmentation and scenario simulation, helping machine learning models generalize in data-scarce domains. While well developed in vision and language, style transfer methods for time series data remain limited. We introduce DiffTSST, a diffusion-based framework that disentangles a time series into content and style representations via convolutional encoders and recombines them through a self-supervised attention-based diffusion process. At inference, encoders extract content and style from two distinct series, enabling conditional generation of novel samples to achieve style transfer. We demonstrate both qualitatively and quantitatively that DiffTSST achieves effective style transfer. We further validate its real-world utility by showing that data augmentation with DiffTSST improves anomaly detection in data-scarce regimes.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California (0.04)
- (8 more...)
Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
Yang, Jianing, Li, Sheng, Shinozaki, Takahiro, Saito, Yuki, Saruwatari, Hiroshi
Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
- Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)
Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Kim, Nam-Gyu, Cho, Deok-Hyeon, Kim, Seung-Bin, Lee, Seong-Whan
Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. V oiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability. Our audio samples are publicly available.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor
ABSTRACT We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure. Audio samples are available at https://gsattsdemo.github.io/
Domain-invariant feature learning in brain MR imaging for content-based image retrieval
Tobari, Shuya, Tomoshige, Shuhei, Muraki, Hayato, Oishi, Kenichi, Iyatomi, Hitoshi
When conducting large-scale studies that collect brain MR images from multiple facilities, the impact of differences in imaging equipment and protocols at each site cannot be ignored, and this domain gap has become a significant issue in recent years. In this study, we propose a new low-dimensional representation (LDR) acquisition method called style encoder adversarial domain adaptation (SE-ADA) to realize content-based image retrieval (CBIR) of brain MR images. SE-ADA reduces domain differences while preserving pathological features by separating domain-specific information from LDR and minimizing domain differences using adversarial learning. In evaluation experiments comparing SE-ADA with recent domain harmonization methods on eight public brain MR datasets (ADNI1/2/3, OASIS1/2/3/4, PPMI), SE-ADA effectively removed domain information while preserving key aspects of the original brain structure and demonstrated the highest disease search accuracy.
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.71)
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.33)