audio effect
Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
Yu, Chin-Yun, Martínez-Ramírez, Marco A., Koo, Junghyun, Liao, Wei-Hsiang, Mitsufuji, Yuki, Fazekas, György
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to an audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can result in unrealistic configurations or biased outcomes. We address this pitfall by introducing a Gaussian prior derived from the DiffVox vocal preset dataset over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces the parameter mean squared error by up to 33% and more closely matches the reference style. Subjective evaluations with 16 participants confirm the superiority of our method in limited data regimes. This work demonstrates how incorporating prior knowledge at inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Media > Music (0.46)
- Leisure & Entertainment (0.46)
Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?
Deng, Qixin, Pardo, Bryan, Pappas, Thrasyvoulos N
Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.
- Media > Music (0.70)
- Leisure & Entertainment (0.70)
Unsupervised Estimation of Nonlinear Audio Effects: Comparing Diffusion-Based and Adversarial approaches
Moliner, Eloi, Švento, Michal, Wright, Alec, Juvela, Lauri, Rajmic, Pavel, Välimäki, Vesa
Accurately estimating nonlinear audio effects without access to paired input-output signals remains a challenging problem. This work studies unsupervised probabilistic approaches for solving this task. We introduce a method, novel for this application, based on diffusion generative models for blind system identification, enabling the estimation of unknown nonlinear effects using black- and gray-box models. This study compares this method with a previously proposed adversarial approach, analyzing the performance of both methods under different parameterizations of the effect operator and varying lengths of available effected recordings. Through experiments on guitar distortion effects, we show that the diffusion-based approach provides more stable results and is less sensitive to data availability, while the adversarial approach is superior at estimating more pronounced distortion effects. Our findings contribute to the robust unsupervised blind estimation of audio effects, demonstrating the potential of diffusion models for system identification in music technology.
- Europe > Italy > Marche > Ancona Province > Ancona (0.05)
- Europe > Czechia > South Moravian Region > Brno (0.05)
- South America > Suriname > North Atlantic Ocean (0.04)
- (2 more...)
- Media > Music (0.34)
- Leisure & Entertainment (0.34)
A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration
Selvamani, Shaja Arul, Ganapathy, Nia D'Souza
This research introduces an innovative AI-driven multi-agent framework specifically designed for creating immersive audiobooks. Leveraging neural text-to-speech synthesis with FastSpeech 2 and VALL-E for expressive narration and character-specific voices, the framework employs advanced language models to automatically interpret textual narratives and generate realistic spatial audio effects. These sound effects are dynamically synchronized with the storyline through sophisticated temporal integration methods, including Dynamic Time Warping (DTW) and recurrent neural networks (RNNs). Diffusion-based generative models combined with higher-order ambisonics (HOA) and scattering delay networks (SDN) enable highly realistic 3D soundscapes, substantially enhancing listener immersion and narrative realism. This technology significantly advances audiobook applications, providing richer experiences for educational content, storytelling platforms, and accessibility solutions for visually impaired audiences. Future work will address personalization, ethical management of synthesized voices, and integration with multi-sensory platforms.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Health & Medicine (1.00)
- Media > Publishing (0.65)
- Information Technology > Security & Privacy (0.49)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects
Deng, Victor, Wang, Changhong, Richard, Gael, McFee, Brian
In recent years, foundation models have significantly advanced data-driven systems across various domains. Yet, their underlying properties, especially when functioning as feature extractors, remain under-explored. In this paper, we investigate the sensitivity to audio effects of audio embeddings extracted from widely-used foundation models, including OpenL3, PANNs, and CLAP. We focus on audio effects as the source of sensitivity due to their prevalent presence in large audio datasets. By applying parameterized audio effects (gain, low-pass filtering, reverberation, and bitcrushing), we analyze the correlation between the deformation trajectories and the effect strength in the embedding space. We propose to quantify the dimensionality and linearizability of the deformation trajectories induced by audio effects using canonical correlation analysis. We find that there exists a direction along which the embeddings move monotonically as the audio effect strength increases, but that the subspace containing the displacements is generally high-dimensional. This shows that pre-trained audio embeddings do not globally linearize the effects. Our empirical results on instrument classification downstream tasks confirm that projecting out the estimated deformation directions cannot generally improve the robustness of pre-trained embeddings to audio effects.
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > United States > New York (0.04)
Open-Amp: Synthetic Data Framework for Audio Effect Foundation Models
Wright, Alec, Carson, Alistair, Juvela, Lauri
This paper introduces Open-Amp, a synthetic data framework for generating large-scale and diverse audio effects data. Audio effects are relevant to many musical audio processing and Music Information Retrieval (MIR) tasks, such as modelling of analog audio effects, automatic mixing, tone matching and transcription. Existing audio effects datasets are limited in scope, usually including relatively few audio effects processors and a limited amount of input audio signals. Our proposed framework overcomes these issues, by crowdsourcing neural network emulations of guitar amplifiers and effects, created by users of open-source audio effects emulation software. This allows users of Open-Amp complete control over the input signals to be processed by the effects models, as well as providing high-quality emulations of hundreds of devices. Open-Amp can render audio online during training, allowing great flexibility in data augmentation. Our experiments show that using Open-Amp to train a guitar effects encoder achieves new state-of-the-art results on multiple guitar effects classification tasks. Furthermore, we train a one-to-many guitar effects model using Open-Amp, and use it to emulate unseen analog effects via manipulation of its learned latent space, indicating transferability to analog guitar effects data.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Surrey > Guildford (0.05)
- Europe > Denmark > Capital Region > Copenhagen (0.05)
- (13 more...)
- Research Report (0.50)
- Instructional Material (0.34)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Modeling Analog Dynamic Range Compressors using Deep Learning and State-space Models
Yin, Hanzhi, Cheng, Gang, Steinmetz, Christian J., Yuan, Ruibin, Stern, Richard M., Dannenberg, Roger B.
We describe a novel approach for developing realistic digital models of dynamic range compressors for digital audio production by analyzing their analog prototypes. While realistic digital dynamic compressors are potentially useful for many applications, the design process is challenging because the compressors operate nonlinearly over long time scales. Our approach is based on the structured state space sequence model (S4), as implementing the state-space model (SSM) has proven to be efficient at learning long-range dependencies and is promising for modeling dynamic range compressors. We present in this paper a deep learning model with S4 layers to model the Teletronix LA-2A analog dynamic range compressor. The model is causal, executes efficiently in real time, and achieves roughly the same quality as previous deep-learning models but with fewer parameters.
- Europe > Austria > Vienna (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Spain > Andalusia > Málaga Province > Málaga (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- Media (0.68)
- Leisure & Entertainment (0.46)
Style Transfer for Non-differentiable Audio Effects
Digital audio effects are widely used by audio engineers to alter the acoustic and temporal qualities of audio data. However, these effects can have a large number of parameters which can make them difficult to learn for beginners and hamper creativity for professionals. Recently, there have been a number of efforts to employ progress in deep learning to acquire the low-level parameter configurations of audio effects by minimising an objective function between an input and reference track, commonly referred to as style transfer. However, current approaches use inflexible black-box techniques or require that the effects under consideration are implemented in an auto-differentiation framework. In this work, we propose a deep learning approach to audio production style matching which can be used with effects implemented in some of the most widely used frameworks, requiring only that the parameters under consideration have a continuous domain. Further, our method includes style matching for various classes of effects, many of which are difficult or impossible to be approximated closely using differentiable functions. We show that our audio embedding approach creates logical encodings of timbral information, which can be used for a number of downstream tasks. Further, we perform a listening test which demonstrates that our approach is able to convincingly style match a multi-band compressor effect.
- Media > Music (0.46)
- Leisure & Entertainment (0.46)
Modulation Extraction for LFO-driven Audio Effects
Mitcheltree, Christopher, Steinmetz, Christian J., Comunità, Marco, Reiss, Joshua D.
Low frequency oscillator (LFO) driven audio effects such as phaser, flanger, and chorus, modify an input signal using time-varying filters and delays, resulting in characteristic sweeping or widening effects. It has been shown that these effects can be modeled using neural networks when conditioned with the ground truth LFO signal. However, in most cases, the LFO signal is not accessible and measurement from the audio signal is nontrivial, hindering the modeling process. To address this, we propose a framework capable of extracting arbitrary LFO signals from processed audio across multiple digital audio effects, parameter settings, and instrument configurations. Since our system imposes no restrictions on the LFO signal shape, we demonstrate its ability to extract quasiperiodic, combined, and distorted modulation signals that are relevant to effect modeling. Furthermore, we show how coupling the extraction model with a simple processing network enables training of end-to-end black-box models of unseen analog or digital LFO-driven audio effects using only dry and wet audio pairs, overcoming the need to access the audio effect or internal LFO signal. We make our code available and provide the trained audio effect models in a real-time VST plugin.
- Europe > Denmark > Capital Region > Copenhagen (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > Middle East > Iran (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Modelling black-box audio effects with time-varying feature modulation
Comunità, Marco, Steinmetz, Christian J., Phan, Huy, Reiss, Joshua D.
ABSTRACT Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. We demonstrate Figure 1: State-of-the-art black-box models like GCN-3 [19] (grey) fail that our approach more accurately captures long-range dependencies to capture the behaviour of effects with large time constants such for a range of fuzz and compressor implementations across both time as fuzz (blue). Our proposed approach GCNTF-3 (orange), which and frequency domain metrics. However, distortion effects such as fuzz can also pose an additional challenge since they exhibit time-varying behaviour 1. INTRODUCTION Fuzz is characterised not only by asymmetrical clipping, Audio effects are tools employed by audio engineers and musicians which for sinusoidal inputs results in a rectangular wave output, but central to shaping the timbre, dynamics, and spatialisation of also for its attack and release time constants which modulate the behaviour sound [1].
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)