Goto

Collaborating Authors

 Speech


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.



TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Neural Information Processing Systems

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation, and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separate encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.



Hypotheses Paradise: An Open and Strong Baseline for Speech Recognition with Large Language Models Chen Chen 1, Chao-Han Huck Yang 2

Neural Information Processing Systems

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription.


HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models Chen Chen 1, Chao-Han Huck Yang 2,3

Neural Information Processing Systems

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription.



Save over 100 on Sony XM4 headphones ahead of Memorial Day

Mashable

SAVE 120: As of May 23, Sony WH-1000XM4 headphones are on sale for 228 at Amazon. If you're looking for a seriously high-quality pair of headphones, you won't want to miss this great deal on Sony XM4s. Premium noise cancellation, stellar sound quality, and Alexa voice control, these are next level. And of May 23, you can get them for less. At Amazon, they are currently on sale for 228, saving you 120 on list price.


Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Neural Information Processing Systems

We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs.


AI comes to Reddit's main search bar - who needs Google now?

ZDNet

It's getting a little easier to use Reddit as a search engine. Last year, Reddit rolled out a new feature called Reddit Answers. Since so many people use Reddit as a Google replacement to tap into the community's immense knowledge, the site introduced AI-curated answers related to the topic you were looking for. Also: Meta's new AI app delivers a chatbot with a social media twist At the time, you had to specifically head to a Reddit Answers tab to use the feature, but now it's coming to the main search bar. In the company's first-quarter earnings call yesterday, Reddit CEO Steve Huffman said that the site was "working to integrate it [Reddit Answers] into Reddit's core search experience to further streamline the path from question to answer on Reddit."