Bryan, Nicholas J.
Presto! Distilling Steps and Layers for Accelerating Music Generation
Novack, Zachary, Zhu, Ge, Casebeer, Jonah, McAuley, Julian, Berg-Kirkpatrick, Taylor, Bryan, Nicholas J.
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffus ion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. We have seen a renaissance of audio-domain generative media (Chen et al., 2024; Agostinelli et al., 2023; Liu et al., 2023; Copet et al., 2023), with increasing capabilities for both Text-to-Audio (TTA) and Text-to-Music (TTM) generation. This work has been driven in-part by audio-domain diffusion models (Song et al., 2020; Ho et al., 2020; Song et al., 2021), enabling considerably better audio modeling than generative adversarial network (GAN) or variational autoencoder (VAE) methods (Dhariwal & Nichol, 2021). Diffusion models, however, suffer from long inference times due to their iterative denoising process, requiring a substantial number of function evaluations (NFE) during inference (i.e.
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
Novack, Zachary, McAuley, Julian, Berg-Kirkpatrick, Taylor, Bryan, Nicholas J.
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.