Goto

Collaborating Authors

 Graham, Calbert


Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning

arXiv.org Artificial Intelligence

VAEs consist of two main parts: a content Voice conversion (VC) modifies voice characteristics while encoder and a decoder. The content encoder processes source preserving linguistic content. This paper presents the Stepback speech, transforms it into a latent representation, and removes network, a novel model for converting speaker identity using speaker information. The decoder takes the speaker identity, non-parallel data. Unlike traditional VC methods that rely on combines it with the latent representation, and reconstructs the parallel data, our approach leverages deep learning techniques speech[5]. A notable VAE approach is disentangling speaker to enhance disentanglement completion and linguistic content and content representations using instance normalization, which preservation.


PSST! Prosodic Speech Segmentation with Transformers

arXiv.org Artificial Intelligence

Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.