Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

Liu, Oli, Tang, Hao, Goldwater, Sharon

arXiv.org Artificial Intelligence 

Self-supervised speech representations are known to encode both In this work, we explicitly investigate how speaker and speaker and phonetic information, but how they are distributed phonetic information are distributed in the representation space in the high-dimensional space remains largely unexplored. We learned by SSL models. We hypothesize that a good representation hypothesize that they are encoded in orthogonal subspaces, a (one that is efficient and works well for predicting speech) property that lends itself to simple disentanglement. Applying should implicitly disentangle these two sources of information, principal component analysis to representations of two predictive since they vary independently in the processes that generate the coding models, we identify two subspaces that capture speaker speech signal. If so, then the two types of information would be and phonetic variances, and confirm that they are nearly orthogonal.