Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster Fine-tuning with Less Labels in Speech Processing
Yang, Hao, Zhao, Jinming, Haffari, Gholamreza, Shareghi, Ehsan
–arXiv.org Artificial Intelligence
Pre-trained speech Transformers have facilitated great success across various speech processing tasks. However, fine-tuning these encoders for downstream tasks require sufficiently large training data to converge or to achieve state-of-the-art. In text domain this has been partly attributed to sub-optimality of the representation space in pre-trained Transformers. In this work, we take a sober look into pre-trained speech encoders and rewire their representation space without requiring any task-specific labels. Our method utilises neutrally synthesised version of audio inputs along with frame masking to construct positive pairs for contrastive self-supervised learning. When used for augmenting the wav2vec 2 encoder, we observe consistent improvement of isotropy in the representation space. Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement, specially in low-resource settings.
arXiv.org Artificial Intelligence
Oct-24-2022
- Genre:
- Research Report (0.64)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Inductive Learning (0.90)
- Neural Networks > Deep Learning (0.47)
- Natural Language (1.00)
- Speech > Speech Recognition (0.68)
- Machine Learning
- Information Technology > Artificial Intelligence