Comunità, Marco
Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls
Gramaccioni, Riccardo Fosco, Marinoni, Christian, Postolache, Emilian, Comunità, Marco, Cosmo, Luca, Reiss, Joshua D., Comminiello, Danilo
Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at https://ispamm.github.io/Stable-V2A.
SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis
Comunità, Marco, Gramaccioni, Riccardo F., Postolache, Emilian, Rodolà, Emanuele, Comminiello, Danilo, Reiss, Joshua D.
Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility
Modulation Extraction for LFO-driven Audio Effects
Mitcheltree, Christopher, Steinmetz, Christian J., Comunità, Marco, Reiss, Joshua D.
Low frequency oscillator (LFO) driven audio effects such as phaser, flanger, and chorus, modify an input signal using time-varying filters and delays, resulting in characteristic sweeping or widening effects. It has been shown that these effects can be modeled using neural networks when conditioned with the ground truth LFO signal. However, in most cases, the LFO signal is not accessible and measurement from the audio signal is nontrivial, hindering the modeling process. To address this, we propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations. Since our system imposes no restrictions on the LFO signal shape, we demonstrate its ability to extract quasiperiodic, combined, and distorted modulation signals that are relevant to effect modeling. Furthermore, we show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects using only dry and wet audio pairs, overcoming the need to access the audio effect or internal LFO signal. We make our code available and provide the trained audio effect models in a real-time VST plugin.
Modelling black-box audio effects with time-varying feature modulation
Comunità, Marco, Steinmetz, Christian J., Phan, Huy, Reiss, Joshua D.
ABSTRACT Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. We demonstrate Figure 1: State-of-the-art black-box models like GCN-3 [19] (grey) fail that our approach more accurately captures long-range dependencies to capture the behaviour of effects with large time constants such for a range of fuzz and compressor implementations across both time as fuzz (blue). Our proposed approach GCNTF-3 (orange), which and frequency domain metrics. However, distortion effects such as fuzz can also pose an additional challenge since they exhibit time-varying behaviour 1. INTRODUCTION Fuzz is characterised not only by asymmetrical clipping, Audio effects are tools employed by audio engineers and musicians which for sinusoidal inputs results in a rectangular wave output, but central to shaping the timbre, dynamics, and spatialisation of also for its attack and release time constants which modulate the behaviour sound [1].
Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks
Comunità, Marco, Phan, Huy, Reiss, Joshua D.
To this day, there has not yet been an attempt at exploring the use of neural networks for the synthesis of footsteps sounds although Footsteps are among the most ubiquitous sound effects in multimedia there is substantial literature exploring neural synthesis of broadband applications. There is substantial research into understanding impulsive sounds, such as drums samples, which have some the acoustic features and developing synthesis models for footstep similarities to footsteps. One of the first attempts was in [15], where sound effects. In this paper, we present a first attempt at adopting Donahue et al. developed WaveGAN - a generative adversarial network neural synthesis for this task. We implemented two GAN-based architectures for unconditional audio synthesis. Another example of neural and compared the results with real recordings as well as synthesis of drums is [16], where the authors used a Progressive six traditional sound synthesis methods. Our architectures reached Growing GAN. Variational autoencoders [17] and U-Nets [18] realism scores as high as recorded samples, showing encouraging have also been used for the same task.