AITopics | Lerch, Alexander

Collaborating Authors

Lerch, Alexander

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Separate This, and All of these Things Around It: Music Source Separation via Hyperellipsoidal Queries

Watcharasupat, Karn N., Lerch, Alexander

arXiv.org Artificial IntelligenceJan-27-2025

Music source separation is an audio-to-audio retrieval task of extracting one or more constituent components, or composites thereof, from a musical audio mixture. Each of these constituent components is often referred to as a "stem" in literature. Historically, music source separation has been dominated by a stem-based paradigm, leading to most state-of-the-art systems being either a collection of single-stem extraction models, or a tightly coupled system with a fixed, difficult-to-modify, set of supported stems. Combined with the limited data availability, advances in music source separation have thus been mostly limited to the "VDBO" set of stems: \textit{vocals}, \textit{drum}, \textit{bass}, and the catch-all \textit{others}. Recent work in music source separation has begun to challenge the fixed-stem paradigm, moving towards models able to extract any musical sound as long as this target type of sound could be specified to the model as an additional query input. We generalize this idea to a \textit{query-by-region} source separation system, specifying the target based on the query regardless of how many sound sources or which sound classes are contained within it. To do so, we propose the use of hyperellipsoidal regions as queries to allow for an intuitive yet easily parametrizable approach to specifying both the target (location) as well as its spread. Evaluation of the proposed system on the MoisesDB dataset demonstrated state-of-the-art performance of the proposed system both in terms of signal-to-noise ratios and retrieval metrics.

artificial intelligence, machine learning, separation, (14 more...)

arXiv.org Artificial Intelligence

2501.16171

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Uncertainty Estimation in the Real World: A Study on Music Emotion Recognition

Watcharasupat, Karn N., Ding, Yiwei, Ma, T. Aleksandra, Seshadri, Pavan, Lerch, Alexander

arXiv.org Artificial IntelligenceJan-20-2025

Any data annotation for subjective tasks shows potential variations between individuals. This is particularly true for annotations of emotional responses to musical stimuli. While older approaches to music emotion recognition systems frequently addressed this uncertainty problem through probabilistic modeling, modern systems based on neural networks tend to ignore the variability and focus only on predicting central tendencies of human subjective responses. In this work, we explore several methods for estimating not only the central tendencies of the subjective responses to a musical stimulus, but also for estimating the uncertainty associated with these responses. In particular, we investigate probabilistic loss functions and inference-time random sampling. Experimental results indicate that while the modeling of the central tendencies is achievable, modeling of the uncertainty in subjective responses proves significantly more challenging with currently available approaches even when empirical estimates of variations in the responses are available.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2501.1157

Country:

Asia (1.00)
Europe > Germany > Bavaria (0.14)
Oceania > Australia > Queensland (0.14)
North America > United States > Hawaii (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Cognitive Science (0.86)

Add feedback

Parameter-Efficient Transfer Learning for Music Foundation Models

Ding, Yiwei, Lerch, Alexander

arXiv.org Artificial IntelligenceNov-28-2024

More music foundation models are recently being released, promising a general, mostly task independent encoding of musical information. Common ways of adapting music foundation models to downstream tasks are probing and fine-tuning. These common transfer learning approaches, however, face challenges. Probing might lead to suboptimal performance because the pre-trained weights are frozen, while fine-tuning is computationally expensive and is prone to overfitting. Our work investigates the use of parameter-efficient transfer learning (PETL) for music foundation models which integrates the advantage of probing and fine-tuning. We introduce three types of PETL methods: adapter-based methods, prompt-based methods, and reparameterization-based methods. These methods train only a small number of parameters, and therefore do not require significant computational resources. Results show that PETL methods outperform both probing and fine-tuning on music auto-tagging. On key detection and tempo estimation, they achieve similar results as fine-tuning with significantly less training cost. However, the usefulness of the current generation of foundation model on key and tempo tasks is questioned by the similar results achieved by training a small model from scratch. Code available at https://github.com/suncerock/peft-music/

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.19371

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.84)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation

Kim, Yonghyun, Lerch, Alexander

arXiv.org Artificial IntelligenceOct-17-2024

For instance, when employing noise injection, several key factors must be considered, Recent advancements in Automatic Piano Transcription such as the type of noise (e.g., white, pink, environmental), (APT) have significantly improved system performance, the Signal-to-Noise-Ratio (SNR), and the ratio of clean to but the impact of noisy environments on the system performance augmented data. However, to the best of our knowledge, remains largely unexplored. This study investigates these parameters are often chosen arbitrarily, highlighting the impact of white noise at various Signal-to-Noise Ratio the need for further investigation in this area.

artificial intelligence, machine learning, transcription, (11 more...)

arXiv.org Artificial Intelligence

2410.14122

Country: North America > United States (0.47)

Genre: Research Report (1.00)

Industry:

Media > Music (0.71)
Leisure & Entertainment (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

Watcharasupat, Karn N., Lerch, Alexander

arXiv.org Artificial IntelligenceJun-26-2024

Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2406.18747

Country:

Asia (0.68)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Embedding Compression for Teacher-to-Student Knowledge Transfer

Ding, Yiwei, Lerch, Alexander

arXiv.org Artificial IntelligenceFeb-9-2024

Common knowledge distillation methods require the teacher model and the student model to be trained on the same task. However, the usage of embeddings as teachers has also been proposed for different source tasks and target tasks. Prior work that uses embeddings as teachers ignores the fact that the teacher embeddings are likely to contain irrelevant knowledge for the target task. To address this problem, we propose to use an embedding compression module with a trainable teacher transformation to obtain a compact teacher embedding. Results show that adding the embedding compression module improves the classification performance, especially for unsupervised teacher embeddings. Moreover, student models trained with the guidance of embeddings show stronger generalizability.

artificial intelligence, machine learning, student model, (17 more...)

arXiv.org Artificial Intelligence

2402.06761

Genre: Research Report > New Finding (0.88)

Industry: Education (0.92)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation

Watcharasupat, Karn N., Wu, Chih-Wei, Ding, Yiwei, Orife, Iroro, Hipple, Aaron J., Williams, Phillip A., Kramer, Scott, Lerch, Alexander, Wolcott, William

arXiv.org Artificial IntelligenceDec-1-2023

Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.

data mining, machine learning, proc, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/OJSP.2023.3339428

2309.02539

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Industry: Media (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming

Hung, Yun-Ning, Yang, Chao-Han Huck, Chen, Pin-Yu, Lerch, Alexander

arXiv.org Artificial IntelligenceMay-3-2023

Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a pre-trained model from a source domain to a target domain by modifying the input of a frozen pre-trained model. In addition to the known, input-independent, reprogramming method, we propose an advanced reprogramming paradigm: Input-dependent NMR, to increase adaptability to complex input data such as musical audio. Experimental results suggest that a neural model pre-trained on large-scale datasets can successfully perform music genre classification by using this reprogramming method. The two proposed Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a small genre classification dataset.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2211.01317

Genre:

Research Report > Promising Solution (0.34)
Research Report > New Finding (0.34)

Industry:

Media > Music (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Music Instrument Classification Reprogrammed

Chen, Hsin-Hung, Lerch, Alexander

arXiv.org Artificial IntelligenceNov-15-2022

The performance of approaches to Music Instrument Classification, a popular task in Music Information Retrieval, is often impacted and limited by the lack of availability of annotated data for training. We propose to address this issue with "reprogramming," a technique that utilizes pre-trained deep and complex neural networks originally targeting a different task by modifying and mapping both the input and output of the pre-trained model. We demonstrate that reprogramming can effectively leverage the power of the representation learned for a different task and that the resulting reprogrammed system can perform on par or even outperform state-of-the-art systems at a fraction of training parameters. Our results, therefore, indicate that reprogramming is a promising technique potentially applicable to other tasks impeded by data scarcity.

machine learning, natural language, pre-trained model, (18 more...)

arXiv.org Artificial Intelligence

2211.08379

Genre:

Research Report > Promising Solution (0.66)
Research Report > New Finding (0.66)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Evaluating generative audio systems and their metrics

Vinay, Ashvala, Lerch, Alexander

arXiv.org Artificial IntelligenceAug-31-2022

Recent years have seen considerable advances in audio synthesis with deep generative models. However, the state-of-the-art is very difficult to quantify; different studies often use different evaluation methodologies and different metrics when reporting results, making a direct comparison to other systems difficult if not impossible. Furthermore, the perceptual relevance and meaning of the reported metrics in most cases unknown, prohibiting any conclusive insights with respect to practical usability and audio quality. This paper presents a study that investigates state-of-the-art approaches side-by-side with (i) a set of previously proposed objective metrics for audio reconstruction, and with (ii) a listening study. The results indicate that currently used objective metrics are insufficient to describe the perceptual quality of current systems.

artificial intelligence, machine learning, objective metric, (19 more...)

arXiv.org Artificial Intelligence

2209.0013

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.94)

Industry:

Media > Music (0.46)
Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback