Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation
Han, Yuchen, Xu, Chen, Xiao, Tong, Zhu, Jingbo
–arXiv.org Artificial Intelligence
Pre-training and fine-tuning is a paradigm for alleviating the data scarcity problem in end-to-end speech translation (E2E ST). The commonplace "modality gap" between speech and text data often leads to inconsistent inputs between pre-training and fine-tuning. However, we observe that this gap occurs in the early stages of fine-tuning, but does not have a major impact on the final performance. On the other hand, we find that there has another gap, which we call the "capacity gap": high resource tasks (such as ASR and MT) always require a large model to fit, when the model is reused for a low resource task (E2E ST), it will get a sub-optimal performance due to the over-fitting. In a case study, we find that the regularization plays a more important role than the well-designed modality adaption method, which achieves 29.0 for en-de and 40.3 for en-fr on the MuST-C dataset. Code and models are available at https://github.com/hannlp/TAB.
arXiv.org Artificial Intelligence
Jun-13-2023
- Country:
- South America > Chile
- Oceania > Australia
- Queensland > Brisbane (0.04)
- North America > United States
- Washington > King County
- Seattle (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Washington > King County
- Europe
- Sweden > Stockholm
- Stockholm (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Czechia > South Moravian Region
- Brno (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Sweden > Stockholm
- Asia
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- China
- Liaoning Province > Shenyang (0.04)
- Shanghai > Shanghai (0.04)
- Middle East > UAE
- Genre:
- Research Report (0.83)
- Technology: