AITopics | McFee, Brian

Plotting

McFee, Brian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects

Deng, Victor, Wang, Changhong, Richard, Gael, McFee, Brian

arXiv.org Artificial IntelligenceJan-27-2025

In recent years, foundation models have significantly advanced data-driven systems across various domains. Yet, their underlying properties, especially when functioning as feature extractors, remain under-explored. In this paper, we investigate the sensitivity to audio effects of audio embeddings extracted from widely-used foundation models, including OpenL3, PANNs, and CLAP. We focus on audio effects as the source of sensitivity due to their prevalent presence in large audio datasets. By applying parameterized audio effects (gain, low-pass filtering, reverberation, and bitcrushing), we analyze the correlation between the deformation trajectories and the effect strength in the embedding space. We propose to quantify the dimensionality and linearizability of the deformation trajectories induced by audio effects using canonical correlation analysis. We find that there exists a direction along which the embeddings move monotonically as the audio effect strength increases, but that the subspace containing the displacements is generally high-dimensional. This shows that pre-trained audio embeddings do not globally linearize the effects. Our empirical results on instrument classification downstream tasks confirm that projecting out the estimated deformation directions cannot generally improve the robustness of pre-trained embeddings to audio effects.

artificial intelligence, audio effect, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2501.159

Country:

Europe > France (0.14)
North America > United States (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Hybrid Losses for Hierarchical Embedding Learning

Tian, Haokun, Lattner, Stefan, McFee, Brian, Saitis, Charalampos

arXiv.org Artificial IntelligenceJan-22-2025

In traditional supervised learning, the cross-entropy loss treats all incorrect predictions equally, ignoring the relevance or proximity of wrong labels to the correct answer. By leveraging a tree hierarchy for fine-grained labels, we investigate hybrid losses, such as generalised triplet and cross-entropy losses, to enforce similarity between labels within a multi-task learning framework. We propose metrics to evaluate the embedding space structure and assess the model's ability to generalise to unseen classes, that is, to infer similar classes for data belonging to unseen categories. Our experiments on OrchideaSOL, a four-level hierarchical instrument sound dataset with nearly 200 detailed categories, demonstrate that the proposed hybrid losses outperform previous works in classification, retrieval, embedding space structure, and generalisation.

artificial intelligence, machine learning, node, (18 more...)

arXiv.org Artificial Intelligence

2501.12796

Country:

North America > United States (0.14)
Europe > France (0.14)

Genre: Research Report (0.40)

Industry:

Media > Music (0.69)
Leisure & Entertainment (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Sound Scene Synthesis at the DCASE 2024 Challenge

Lagrange, Mathieu, Lee, Junwon, Tailleur, Modan, Heller, Laurie M., Choi, Keunwoo, McFee, Brian, Imoto, Keisuke, Okamoto, Yuki

arXiv.org Artificial IntelligenceJan-15-2025

This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fr\'echet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.08587

Country:

Asia (0.95)
Europe (0.69)
North America > United States (0.29)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

Lee, Junwon, Tailleur, Modan, Heller, Laurie M., Choi, Keunwoo, Lagrange, Mathieu, McFee, Brian, Imoto, Keisuke, Okamoto, Yuki

arXiv.org Artificial IntelligenceOct-23-2024

Despite significant advancements in neural text-to-audio generation, challenges persist in controllability and evaluation. This paper addresses these issues through the Sound Scene Synthesis challenge held as part of the Detection and Classification of Acoustic Scenes and Events 2024. We present an evaluation protocol combining objective metric, namely Fr\'echet Audio Distance, with perceptual assessments, utilizing a structured prompt format to enable diverse captions and effective evaluation. Our analysis reveals varying performance across sound categories and model architectures, with larger models generally excelling but innovative lightweight approaches also showing promise. The strong correlation between objective metrics and human ratings validates our evaluation approach. We discuss outcomes in terms of audio quality, controllability, and architectural considerations for text-to-audio synthesizers, providing direction for future research.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.17589

Country:

Asia > China (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Spatial Scaper: A Library to Simulate and Augment Soundscapes for Sound Event Localization and Detection in Realistic Rooms

Roman, Iran R., Ick, Christopher, Ding, Sivan, Roman, Adrian S., McFee, Brian, Bello, Juan P.

arXiv.org Artificial IntelligenceJan-19-2024

Sound event localization and detection (SELD) is an important task in machine listening. Major advancements rely on simulated data with sound events in specific rooms and strong spatio-temporal labels. SELD data is simulated by convolving spatialy-localized room impulse responses (RIRs) with sound waveforms to place sound events in a soundscape. However, RIRs require manual collection in specific rooms. We present SpatialScaper, a library for SELD data simulation and augmentation. Compared to existing tools, SpatialScaper emulates virtual rooms via parameters such as size and wall absorption. This allows for parameterized placement (including movement) of foreground and background sound sources. SpatialScaper also includes data augmentation pipelines that can be applied to existing SELD data. As a case study, we use SpatialScaper to add rooms to the DCASE SELD data. Training a model with our data led to progressive performance improves as a direct function of acoustic diversity. These results show that SpatialScaper is valuable to train robust SELD models.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2401.12238

Country: North America > United States > California (0.14)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Proposal for Foley Sound Synthesis Challenge

Choi, Keunwoo, Oh, Sangshin, Kang, Minsung, McFee, Brian

arXiv.org Artificial IntelligenceJul-21-2022

We during post-production to enhance its perceived acoustic properties, review recent machine learning challenges in audio, speech, and e.g., by simulating the sounds of footsteps, ambient environmental music research in Section 2 and existing works and datasets in Section sounds, or visible objects on the screen. While foley is traditionally 3. In Section 4, we provide a proposal for foley sound synthesis produced by foley artists, there is increasing interest in automatic challenge that includes problem definition, datasets, and evaluation or machine-assisted techniques building upon recent advances in metrics. We conclude the paper in Section 5. sound synthesis and generative models. To foster more participation in this growing research area, we propose a challenge for automatic 2. CASE STUDY: RESEARCH CHALLENGES foley synthesis. Through case studies on successful previous challenges in audio and machine learning, we set the goals of In this section, we review five existing research challenges: Blizzard the proposed challenge: rigorous, unified, and efficient evaluation Challenge, CHiME, DCASE, Music Demixing challenge, and of different foley synthesis systems, with an overarching goal of AI Song Contest. The former three are relatively mature while the drawing active participation from the research community. We outline latter two started after 2020. All of them started along with the increasing the details and design considerations of a foley sound synthesis popularity of the research problems and have contributed challenge, including task definition, dataset requirements, and evaluation to the continued growth by defining the tasks, providing common criteria.

artificial intelligence, machine learning, survey article, (18 more...)

arXiv.org Artificial Intelligence

2207.1076

Genre:

Research Report (0.50)
Overview (0.48)

Industry:

Leisure & Entertainment (0.69)
Media > Music (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.91)
Information Technology > Artificial Intelligence > Speech (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Adaptive pooling operators for weakly labeled sound event detection

McFee, Brian, Salamon, Justin, Bello, Juan Pablo

arXiv.org Machine LearningApr-26-2018

Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for human annotators to produce, which limits the practical scalability of SED methods. In this work, we treat SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality. The models, however, must still produce temporally dynamic predictions, which must be aggregated (pooled) when comparing against static labels during training. To facilitate this aggregation, we develop a family of adaptive pooling operators---referred to as auto-pool---which smoothly interpolate between common pooling operators, such as min-, max-, or average-pooling, and automatically adapt to the characteristics of the sound sources in question. We evaluate the proposed pooling operators on three datasets, and demonstrate that in each case, the proposed methods outperform non-adaptive pooling operators for static prediction, and nearly match the performance of models trained with strong, dynamic annotations. The proposed method is evaluated in conjunction with convolutional neural networks, but can be readily applied to any differentiable model for time-series label prediction.

deep learning, neural network, prediction, (21 more...)

arXiv.org Machine Learning

1804.1007

Country:

Europe (1.00)
North America > United States > Alaska (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (0.88)
Media > Music (0.66)
Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Towards Music Captioning: Generating Music Playlist Descriptions

Choi, Keunwoo, Fazekas, George, McFee, Brian, Cho, Kyunghyun, Sandler, Mark

arXiv.org Artificial IntelligenceJan-15-2017

Descriptions are often provided along with recommendations to help users' discovery. Recommending automatically generated music playlists (e.g. personalised playlists) introduces the problem of generating descriptions. In this paper, we propose a method for generating music playlist descriptions, which is called as music captioning. In the proposed method, audio content analysis and natural language processing are adopted to utilise the information of each track.

deep learning, neural network, sequence, (17 more...)

arXiv.org Artificial Intelligence

1608.04868

Country:

Europe > Spain (0.15)
Asia > Middle East (0.15)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Learning Multi-modal Similarity

McFee, Brian, Lanckriet, Gert

arXiv.org Artificial IntelligenceAug-30-2010

In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, e.g., nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transfor- mations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi- media similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure.

kernel, optimization problem, survey article, (19 more...)

arXiv.org Artificial Intelligence

1008.5163

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England (0.14)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback