samplernn
Full-band General Audio Synthesis with Score-based Diffusion
Pascual, Santiago, Bhattacharya, Gautam, Yeh, Chunghsin, Pons, Jordi, Serrà, Joan
Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning
Liu, Xubo, Iqbal, Turab, Zhao, Jinzheng, Huang, Qiushi, Plumbley, Mark D., Wang, Wenwu
Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention, despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates through a limited number of samples. In this work, we propose a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes. This offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips. We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds. Experimental results show that our method offers comparable performance in quality and significantly better performance in diversity.
Inside the the World's First Mainstream Album Made With AI
This article is part of New York's Future Issue, a collection of predictions about the near future as seen through the recent past. Click here to read more. On June 21, 2017, electronic musician Holly Herndon and her husband, writer/philosopher/teacher Mat Dryhurst, welcomed a new addition to their family. "She's an inhuman child," Herndon tells me one afternoon, while seated in the offices of her record label, 4AD. Spawn is nascent machine intelligence, or AI. There's artificial intelligence being deployed for self-driving 18-wheelers, Netflix user-preference predictors, customer service preferences, handwriting recognition, and cyber-security to fight hackers using AI to create malware.
- Information Technology > Security & Privacy (0.76)
- Transportation > Ground > Road (0.55)