69c754f571806bf15add18556ff39b4f-Supplemental-Conference.pdf

Neural Information Processing Systems 

Similar to the previous analysis of XLSR-53 (Choi et al., 2021), the representations from the 1st layer of XLS-R are already clustered by each speaker while it is hard to distinguish the representations of thelatterlayerbyeachspeaker. HierSpeech-UVCTK+LibriTTS (20) 3.71 15.85 6.40 4.09 30.64Untranscribed text-to-speech We describe the results of the objective evaluation for speaker adaptationinTable11. Hence, the data augmentation for speech disentanglement is not necessaryinourmethod. Note that we fail to train the model with the representations from the 23th layer of XLS-R. We train Tacotron 2 with batch size of 256 for 100k steps.