Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Fang, Hung-Chieh, Ye, Nai-Xuan, Shih, Yi-Jen, Peng, Puyuan, Wang, Hsuan-Fu, Berry, Layne, Lee, Hung-yi, Harwath, David

Feb-8-2024–arXiv.org Artificial Intelligence

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-8-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > Texas (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Text Processing (0.89)
  - Speech > Speech Recognition (1.00)