Fast Text-to-Audio Generation with Adversarial Post-Training

Novack, Zachary, Evans, Zach, Zukowski, Zack, Taylor, Josiah, Carr, CJ, Parker, Julian, Al-Sinan, Adnan, Iodice, Gian Marco, McAuley, Julian, Berg-Kirkpatrick, Taylor, Pons, Jordi

May-21-2025–arXiv.org Artificial Intelligence

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating $\approx$12s of 44.1kHz stereo audio in $\approx$75ms on an H100, and $\approx$7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

artificial intelligence, arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

May-21-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Diego County > San Diego (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)