StreamFlow: Streaming Audio Generation from Discrete Tokens via Streaming Flow Matching

Jun-14-2026, 11:42:42 GMT–Neural Information Processing Systems

Diffusion models have demonstrated remarkable generative capabilities, and Conditional Flow Matching (CFM) has improved their inference efficiency by following optimal transport paths. However, CFM-based models still require multiple iterative sampling steps, which makes them unsuitable for real-time or streaming generation scenarios. In this paper, we introduce StreamFlow, a novel streaming generative model designed for real-time audio generation from discrete tokens. StreamFlow leverages a causal noising training framework along the time axis and predicts multi-time vector fields at once on each stream, enabling streaming inference with minimal latency. To further improve generalization, we propose Scale-DiT, a Diffusion Transformer architecture that enhances robustness by modeling, normalizing, and scaling feature differences prior to skip connections. This significantly improves the robustness and performance of DiT without increasing the parameter size.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Jun-14-2026, 11:42:42 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Speech (1.00)
  - Natural Language > Large Language Model (0.88)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found