DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

Jan-20-2025, 01:05:30 GMT–Neural Information Processing Systems

However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG).

daspeech, decoder, fast and high-quality speech-to-speech translation, (5 more...)

Neural Information Processing Systems

Jan-20-2025, 01:05:30 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (0.56)
  - Speech > Speech Recognition (0.43)