StreamFlow: Streaming Audio Generation from Discrete Tokens via Streaming Flow Matching
–Neural Information Processing Systems
Diffusion models have demonstrated remarkable generative capabilities, and Conditional Flow Matching (CFM) has improved their inference efficiency by following optimal transport paths. However, CFM-based models still require multiple iterative sampling steps, which makes them unsuitable for real-time or streaming generation scenarios. In this paper, we introduce StreamFlow, a novel streaming generative model designed for real-time audio generation from discrete tokens. StreamFlow leverages a causal noising training framework along the time axis and predicts multi-time vector fields at once on each stream, enabling streaming inference with minimal latency. To further improve generalization, we propose Scale-DiT, a Diffusion Transformer architecture that enhances robustness by modeling, normalizing, and scaling feature differences prior to skip connections. This significantly improves the robustness and performance of DiT without increasing the parameter size.
Neural Information Processing Systems
Jun-14-2026, 11:42:42 GMT
- Country:
- Asia (0.28)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Media (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Speech (1.00)
- Natural Language > Large Language Model (0.88)
- Machine Learning > Neural Networks
- Deep Learning (0.48)
- Information Technology > Artificial Intelligence