CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought
Du, Yexing, Ma, Ziyang, Yang, Yifan, Deng, Keqi, Chen, Xie, Yang, Bo, Xiang, Yang, Liu, Ming, Qin, Bing
–arXiv.org Artificial Intelligence
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2 .
arXiv.org Artificial Intelligence
Sep-28-2024
- Country:
- Europe
- Belgium (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Asia > China
- Shanghai > Shanghai (0.04)
- Heilongjiang Province > Harbin (0.04)
- Europe
- Genre:
- Research Report
- Promising Solution (0.48)
- New Finding (0.48)
- Research Report
- Technology: