AITopics | self-supervised speech representation

Collaborating Authors

self-supervised speech representation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

69c754f571806bf15add18556ff39b4f-Paper-Conference.pdf

shlee

Neural Information Processing SystemsFeb-9-2026, 14:16:34 GMT

hierspeech, representation, self-supervised speech representation, (12 more...)

Neural Information Processing Systems

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis

Neural Information Processing SystemsDec-24-2025, 09:38:05 GMT

This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised speech representations. Recently, single-stage TTS systems, which directly generate raw speech waveform from text, have been getting interest thanks to their ability in generating high-quality audio within a fully end-to-end training pipeline. However, there is still a room for improvement in the conventional TTS systems. Since it is challenging to infer both the linguistic and acoustic attributes from the text directly, missing the details of attributes, specifically linguistic information, is inevitable, which results in mispronunciation and over-smoothing problem in their synthetic speech. To address the aforementioned problem, we leverage self-supervised speech representations as additional linguistic representations to bridge an information gap between text and speech.

hierarchical variational inference, representation, self-supervised representation, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.55)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.41)

Add feedback

HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis Sang-Hoon Lee 1 Seung-Bin Kim 2 Ji-Hyun Lee 2

shlee

Neural Information Processing SystemsAug-15-2025, 14:05:46 GMT

HierSpeech-U can adapt to a novel speaker by utilizing self-supervised speech representations without text transcripts.

artificial intelligence, machine learning, representation, (15 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.46)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis

Neural Information Processing SystemsOct-11-2024, 11:20:17 GMT

hierarchical variational inference, representation, self-supervised speech representation, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.76)

Add feedback

Self-Supervised Speech Representations are More Phonetic than Semantic

Choi, Kwanghee, Pasad, Ankita, Nakamura, Tomohiko, Fukayama, Satoru, Livescu, Karen, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-12-2024

Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores on these datasets do not necessarily guarantee the presence of semantic content.

information, representation, word pair, (13 more...)

arXiv.org Artificial Intelligence

2406.08619

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(8 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.90)

Add feedback

Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations

Kakoulidis, Panos, Ellinas, Nikolaos, Vamvoukakis, Georgios, Christidou, Myrsini, Vioni, Alexandra, Maniati, Georgia, Oh, Junkwang, Jho, Gunu, Hwang, Inchul, Tsiakoulis, Pirros, Chalamandaris, Aimilios

arXiv.org Artificial IntelligenceFeb-2-2024

In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.

acoustic model, proc, synthesis, (12 more...)

arXiv.org Artificial Intelligence

2402.0152

Country: Europe > Greece (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech (0.97)

Add feedback

Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching

Ògúnrèmí, Tolúlopé, Manning, Christopher D., Jurafsky, Dan

arXiv.org Artificial IntelligenceNov-25-2023

While many speakers of low-resource languages regularly code-switch between their languages and other regional languages or English, datasets of codeswitched speech are too small to train bespoke acoustic models from scratch or do language model rescoring. Here we propose finetuning self-supervised speech representations such as wav2vec 2.0 XLSR to recognize code-switched data. We find that finetuning self-supervised multilingual representations and augmenting them with n-gram language models trained from transcripts reduces absolute word error rates by up to 20% compared to baselines of hybrid models trained from scratch on code-switched data. Our findings suggest that in circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.

language pair, recognition, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2311.15077

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Lee, Sang-Hoon, Choi, Ha-Yeong, Oh, Hyung-Seok, Lee, Seong-Whan

arXiv.org Artificial IntelligenceJul-30-2023

Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.

linguistic representation, representation, speech, (14 more...)

arXiv.org Artificial Intelligence

2307.16171

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Wang, Siyang, Henter, Gustav Eje, Gustafson, Joakim, Székely, Éva

arXiv.org Artificial IntelligenceJul-10-2023

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2303.02719

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.30)

Add feedback

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

Close, George, Ravenscroft, William, Hain, Thomas, Goetze, Stefan

arXiv.org Artificial IntelligenceJun-26-2023

Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).

artificial intelligence, machine learning, representation, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10095666

2301.04388

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Data Science (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback