A Appendix
–Neural Information Processing Systems
A.1 Self-supervised loss formula Wav2vec 2.0, when trained in a self-supervised way, uses a loss ( L) which is the weighted combination of two losses: one diversity loss ( L Then, we use nistats [Abraham et al., 2014] compute_regressor function with the'glover' model to temporally convolve ( h R To address this issue, [Pasad et al., 2021] explored the encoding of local acoustic features, phone identity, word identity and word meaning across layers. Similarly, [Millet et al., 2021] compared representations 17 to human behavioural data to assess whether they better captured listeners' perception of higher-level phonemic properties or of lower-level subphonemic properties of speech stimuli. Finally, [V aidya et al., 2022] recent study explores filter banks, spectrograms, phonemes and words across layers. Here, we complement these analyses by showing that self-supervised learning allows wav2vec 2.0 to learn represents, along its hierarchy the representations of MEL spectrograms, phonetic categories and word embeddings (Figure S1). We study the following features: the MEL spectrogram of the audio, computed using librosa (d=128) the phonemes (categorical features).
Neural Information Processing Systems
Aug-19-2025, 07:48:25 GMT