A2SB: Audio-to-Audio Schrodinger Bridges

Kong, Zhifeng, Shih, Kevin J, Nie, Weili, Vahdat, Arash, Lee, Sang-gil, Santos, Joao Felipe, Jukic, Ante, Valle, Rafael, Catanzaro, Bryan

arXiv.org Artificial Intelligence 

Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. Our demo website is https: //research.nvidia.com/labs/adlr/A2SB/ Audio in the real world may be perturbed due to numerous factors such as recording devices, data compression, and online transferring. For instance, certain recording devices and compression methods may result in low sampling rate, and online transferring may cause a short audio segment to be lost. These problems are usually ill-posed (Narayanaswamy et al., 2021; Moliner et al., 2023) and are usually solved with data-driven generative models. Many of these methods are task-specific, designed for the speech domain, or trained to only restore the degraded magnitude - which requires an additional vocoder to transform restored magnitude into waveform.