Cosmo, Luca
Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls
Gramaccioni, Riccardo Fosco, Marinoni, Christian, Postolache, Emilian, Comunità, Marco, Cosmo, Luca, Reiss, Joshua D., Comminiello, Danilo
Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at https://ispamm.github.io/Stable-V2A.
Naturalistic Music Decoding from EEG Data via Latent Diffusion Models
Postolache, Emilian, Polouliakh, Natalia, Kitano, Hiroaki, Connelly, Akima, Rodolà, Emanuele, Cosmo, Luca, Akama, Taketo
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
Graph Kernel Neural Networks
Cosmo, Luca, Minello, Giorgia, Bicciato, Alessandro, Bronstein, Michael, Rodolà, Emanuele, Rossi, Luca, Torsello, Andrea
The convolution operator at the core of many modern neural architectures can effectively be seen as performing a dot product between an input matrix and a filter. While this is readily applicable to data such as images, which can be represented as regular grids in the Euclidean space, extending the convolution operator to work on graphs proves more challenging, due to their irregular structure. In this paper, we propose to use graph kernels, i.e. kernel functions that compute an inner product on graphs, to extend the standard convolution operator to the graph domain. This allows us to define an entirely structural model that does not require computing the embedding of the input graph. Our architecture allows to plug-in any type of graph kernels and has the added benefit of providing some interpretability in terms of the structural masks that are learned during the training process, similarly to what happens for convolutional masks in traditional convolutional neural networks. We perform an extensive ablation study to investigate the model hyper-parameters' impact and show that our model achieves competitive performance on standard graph classification and regression datasets.
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
Ciranni, Ruben, Postolache, Emilian, Mariani, Giorgio, Mancusi, Michele, Cosmo, Luca, Rodolà, Emanuele
We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of stems (or their combinations) composing music tracks and allows the objective evaluation of compositional models for music in the task of accompaniment generation. We also introduce a new baseline for compositional music generation called CompoNet, based on ControlNet, generalizing the tasks of MSDM, and quantify it against the latter using COCOLA. We release all models trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).
Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models
Postolache, Emilian, Mariani, Giorgio, Cosmo, Luca, Benetos, Emmanouil, Rodolà, Emanuele
Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks: generating a set of coherent sources, creating accompaniments, and performing source separation. Despite their versatility, they require estimating the joint distribution over the sources, necessitating pre-separated musical data, which is rarely available, and fixing the number and type of sources at training time. This paper generalizes MSDM to arbitrary time-domain diffusion models conditioned on text embeddings. These models do not require separated data as they are trained on mixtures, can parameterize an arbitrary number of sources, and allow for rich semantic control. We propose an inference procedure enabling the coherent generation of sources and accompaniments. Additionally, we adapt the Dirac separator of MSDM to perform source separation. We experiment with diffusion models trained on Slakh2100 and MTG-Jamendo, showcasing competitive generation and separation results in a relaxed data setting.
Graph Generation via Spectral Diffusion
Minello, Giorgia, Bicciato, Alessandro, Rossi, Luca, Torsello, Andrea, Cosmo, Luca
While these models may excel in capturing a set of predefined properties, they are often In this paper, we present GRASP, a novel graph unable to represent a wider range of aspects observed in generative model based on 1) the spectral decomposition real-world graphs. In addition, in several domains network of the graph Laplacian matrix and 2) a properties are largely unknown, which further limits the diffusion process. Specifically, we propose to use applicability of these traditional techniques. For instance, a denoising model to sample eigenvectors and the Barabási-Albert model (Albert & Barabási, 2002) allows eigenvalues from which we can reconstruct the to create graphs that exhibit the scale-free nature found graph Laplacian and adjacency matrix. Our permutation in empirical degree distributions, however it is unable to invariant model can also handle node capture other facets of real-world graphs, e.g., community features by concatenating them to the eigenvectors structure. Overcoming these shortcomings has then become of each node. Using the Laplacian spectrum a necessary step towards enhancing the expressivity and allows us to naturally capture the structural characteristics fidelity of generated graphs, which in turn would extend the of the graph and work directly in the range of possible applications of graph generative models.
GNN-LoFI: a Novel Graph Neural Network through Localized Feature-based Histogram Intersection
Bicciato, Alessandro, Cosmo, Luca, Minello, Giorgia, Rossi, Luca, Torsello, Andrea
Graph neural networks are increasingly becoming the framework of choice for graph-based machine learning. In this paper, we propose a new graph neural network architecture that substitutes classical message passing with an analysis of the local distribution of node features. To this end, we extract the distribution of features in the egonet for each local neighbourhood and compare them against a set of learned label distributions by taking the histogram intersection kernel. The similarity information is then propagated to other nodes in the network, effectively creating a message passing-like mechanism where the message is determined by the ensemble of the features. We perform an ablation study to evaluate the network's performance under different choices of its hyper-parameters. Finally, we test our model on standard graph classification and regression benchmarks, and we find that it outperforms widely used alternative approaches, including both graph kernels and graph neural networks.
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
Mariani, Giorgio, Tallini, Irene, Postolache, Emilian, Mancusi, Michele, Cosmo, Luca, Rodolà, Emanuele
In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e., generating a mixture, separating the sources), we also introduce and experiment on the partial generation task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task based on Dirac likelihood functions. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the source separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
Spectral Maps for Learning on Subgraphs
Pegoraro, Marco, Marin, Riccardo, Rampini, Arianna, Melzi, Simone, Cosmo, Luca, Rodolà, Emanuele
In graph learning, maps between graphs and their subgraphs frequently arise. For instance, when coarsening or rewiring operations are present along the pipeline, one needs to keep track of the corresponding nodes between the original and modified graphs. Classically, these maps are represented as binary node-to-node correspondence matrices and used as-is to transfer node-wise features between the graphs. In this paper, we argue that simply changing this map representation can bring notable benefits to graph learning tasks. Drawing inspiration from recent progress in geometry processing, we introduce a spectral representation for maps that is easy to integrate into existing graph learning models. This spectral representation is a compact and straightforward plug-in replacement and is robust to topological changes of the graphs. Remarkably, the representation exhibits structural properties that make it interpretable, drawing an analogy with recent results on smooth manifolds. We demonstrate the benefits of incorporating spectral maps in graph learning pipelines, addressing scenarios where a node-to-node map is not well defined, or in the absence of exact isomorphism. Our approach bears practical benefits in knowledge distillation and hierarchical learning, where we show comparable or improved performance at a fraction of the computational cost.
Latent Autoregressive Source Separation
Postolache, Emilian, Mariani, Giorgio, Mancusi, Michele, Santilli, Andrea, Cosmo, Luca, Rodolà, Emanuele
Autoregressive models have achieved impressive results over a wide range of domains in terms of generation quality and downstream task performance. In the continuous domain, a key factor behind this success is the usage of quantized latent spaces (e.g., obtained via VQ-VAE autoencoders), which allow for dimensionality reduction and faster inference times. However, using existing pre-trained models to perform new non-trivial tasks is difficult since it requires additional fine-tuning or extensive training to elicit prompting. This paper introduces LASS as a way to perform vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models. Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens. We test our method on images and audio with several sampling strategies (e.g., ancestral, beam search) showing competitive results with existing approaches in terms of separation quality while offering at the same time significant speedups in terms of inference time and scalability to higher dimensional data.