Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

Zeghidour, Neil, Kharitonov, Eugene, Orsini, Manu, Volhejn, Václav, de Marmiesse, Gabriel, Grave, Edouard, Pérez, Patrick, Mazaré, Laurent, Défossez, Alexandre

Sep-30-2025–arXiv.org Artificial Intelligence

We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Sep-30-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States (0.68)
- Oceania > Australia (0.46)

Genre:
- Research Report (0.82)

Industry:
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Speech > Speech Recognition (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)