Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs
De Cristofaro, Domenico, Vitale, Vincenzo Norman, Vietti, Alessandro
–arXiv.org Artificial Intelligence
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.
arXiv.org Artificial Intelligence
Aug-26-2025
- Country:
- Europe > Italy > Trentino-Alto Adige/Südtirol > South Tyrol (0.04)
- Genre:
- Research Report > New Finding (0.94)
- Technology: