69c754f571806bf15add18556ff39b4f-Supplemental-Conference.pdf

Feb-9-2026, 14:16:37 GMT–Neural Information Processing Systems

Similar to the previous analysis of XLSR-53 (Choi et al., 2021), the representations from the 1st layer of XLS-R are already clustered by each speaker while it is hard to distinguish the representations of thelatterlayerbyeachspeaker. HierSpeech-UVCTK+LibriTTS (20) 3.71 15.85 6.40 4.09 30.64Untranscribed text-to-speech We describe the results of the objective evaluation for speaker adaptationinTable11. Hence, the data augmentation for speech disentanglement is not necessaryinourmethod. Note that we fail to train the model with the representations from the 23th layer of XLS-R. We train Tacotron 2 with batch size of 256 for 100k steps.

artificial intelligence, representation, speech recognition, (14 more...)

Neural Information Processing Systems

Feb-9-2026, 14:16:37 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Duplicate Docs Excel Report

Title
A Implementation Details

Similar Docs Excel Report more

Title	Similarity	Source
None found