CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Du, Yexing, Ma, Ziyang, Yang, Yifan, Deng, Keqi, Chen, Xie, Yang, Bo, Xiang, Yang, Liu, Ming, Qin, Bing

Sep-28-2024–arXiv.org Artificial Intelligence

Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2 .

arxiv preprint arxiv, speech recognition, translation, (12 more...)

arXiv.org Artificial Intelligence

Sep-28-2024

arXiv.org PDF

Add feedback

Country:
- Europe
  - Belgium (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)
  - Heilongjiang Province > Harbin (0.04)

Genre:
- Research Report
  - Promising Solution (0.48)
  - New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language
    - Machine Translation (1.00)
    - Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found