AITopics | Comunità, Marco

Collaborating Authors

Comunità, Marco

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls

Gramaccioni, Riccardo Fosco, Marinoni, Christian, Postolache, Emilian, Comunità, Marco, Cosmo, Luca, Reiss, Joshua D., Comminiello, Danilo

arXiv.org Artificial IntelligenceJan-2-2025

Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at https://ispamm.github.io/Stable-V2A.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.15023

Country: Europe > Italy (0.46)

Genre: Research Report (0.64)

Industry:

Media > Film (0.66)
Leisure & Entertainment > Games > Computer Games (0.54)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

Comunità, Marco, Gramaccioni, Riccardo F., Postolache, Emilian, Rodolà, Emanuele, Comminiello, Danilo, Reiss, Joshua D.

arXiv.org Artificial IntelligenceOct-23-2023

Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility

artificial intelligence, multimodal onset-synchronized video-to-audio foley synthesis, syncfusion

arXiv.org Artificial Intelligence

2310.15247

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence (0.73)

Add feedback

Modulation Extraction for LFO-driven Audio Effects

Mitcheltree, Christopher, Steinmetz, Christian J., Comunità, Marco, Reiss, Joshua D.

arXiv.org Artificial IntelligenceMay-22-2023

Low frequency oscillator (LFO) driven audio effects such as phaser, flanger, and chorus, modify an input signal using time-varying filters and delays, resulting in characteristic sweeping or widening effects. It has been shown that these effects can be modeled using neural networks when conditioned with the ground truth LFO signal. However, in most cases, the LFO signal is not accessible and measurement from the audio signal is nontrivial, hindering the modeling process. To address this, we propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations. Since our system imposes no restrictions on the LFO signal shape, we demonstrate its ability to extract quasiperiodic, combined, and distorted modulation signals that are relevant to effect modeling. Furthermore, we show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects using only dry and wet audio pairs, overcoming the need to access the audio effect or internal LFO signal. We make our code available and provide the trained audio effect models in a real-time VST plugin.

artificial intelligence, lfo signal, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2305.13262

Country: Europe > Denmark (0.16)

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Modelling black-box audio effects with time-varying feature modulation

Comunità, Marco, Steinmetz, Christian J., Phan, Huy, Reiss, Joshua D.

arXiv.org Artificial IntelligenceMay-9-2023

ABSTRACT Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. We demonstrate Figure 1: State-of-the-art black-box models like GCN-3 [19] (grey) fail that our approach more accurately captures long-range dependencies to capture the behaviour of effects with large time constants such for a range of fuzz and compressor implementations across both time as fuzz (blue). Our proposed approach GCNTF-3 (orange), which and frequency domain metrics. However, distortion effects such as fuzz can also pose an additional challenge since they exhibit time-varying behaviour 1. INTRODUCTION Fuzz is characterised not only by asymmetrical clipping, Audio effects are tools employed by audio engineers and musicians which for sinusoidal inputs results in a rectangular wave output, but central to shaping the timbre, dynamics, and spatialisation of also for its attack and release time constants which modulate the behaviour sound [1].

artificial intelligence, audio effect, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10097173

2211.00497

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Transportation > Air (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks

Comunità, Marco, Phan, Huy, Reiss, Joshua D.

arXiv.org Artificial IntelligenceOct-18-2021

To this day, there has not yet been an attempt at exploring the use of neural networks for the synthesis of footsteps sounds although Footsteps are among the most ubiquitous sound effects in multimedia there is substantial literature exploring neural synthesis of broadband applications. There is substantial research into understanding impulsive sounds, such as drums samples, which have some the acoustic features and developing synthesis models for footstep similarities to footsteps. One of the first attempts was in [15], where sound effects. In this paper, we present a first attempt at adopting Donahue et al. developed WaveGAN - a generative adversarial network neural synthesis for this task. We implemented two GAN-based architectures for unconditional audio synthesis. Another example of neural and compared the results with real recordings as well as synthesis of drums is [16], where the authors used a Progressive six traditional sound synthesis methods. Our architectures reached Growing GAN. Variational autoencoders [17] and U-Nets [18] realism scores as high as recorded samples, showing encouraging have also been used for the same task.

artificial intelligence, machine learning, neural network, (18 more...)

arXiv.org Artificial Intelligence

2110.09605

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback