Optimizing Speech Multi-View Feature Fusion through Conditional Computation

Shan, Weiqiao, Zhang, Yuhao, Han, Yuchen, Li, Bei, Zhao, Xiaofeng, Li, Yuang, Zhang, Min, Yang, Hao, Xiao, Tong, Zhu, Jingbo

Jan-14-2025–arXiv.org Artificial Intelligence

Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Jan-14-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Liaoning Province (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning (0.93)
  - Speech > Speech Recognition (0.68)
  - Natural Language > Machine Translation (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found