Plotting

 Schlüter, Jan


Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

arXiv.org Artificial Intelligence

The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks.


Efficient Training of Audio Transformers with Patchout

arXiv.org Artificial Intelligence

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST


Over-Parameterization and Generalization in Audio Classification

arXiv.org Machine Learning

Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities. Specifically, we test scaling CNNs in width and depth, under different conditions. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.


Deep Learning for Audio Signal Processing

arXiv.org Machine Learning

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Abstract--Given the recent surge in developments of deep x learning, this article provides a review of the state-of-the-art input sequence deep learning techniques for audio signal processing. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic The number of labels to be predicted (left), and the type of each label (right). While many deep learning methods have been adopted from I. INTRODUCTION Audio [2] in 1986, and finally 3) the success of deep learning in signals are commonly transformed into two-dimensional timefrequency speech recognition [3] and image classification [4] in 2012, representations for processing, but the two axes, leading to a renaissance of deep learning, involving e.g. Images are instantaneous snapshots networks (CNNs, [6]) and long short-term memory (LSTM, of a target and often analyzed as a whole or in patches [7]). In this "deep" paradigm, architectures with a large number with little order constraints; however audio signals have to be of parameters are trained to learn from a massive amount of studied sequentially in chronological order. METHODS many areas of signal processing, often outperforming traditional To set the stage, we give a conceptual overview of audio signal processing on a large scale. In this most recent analysis and synthesis problems (II-A), the input representations wave, deep learning first gained traction in image processing commonly used to address them (II-B), and the models [4], but was then widely adopted in speech processing, music shared between different application fields (II-C). H. Purwins is with Department of Architecture, Design & Media Technology, This division encompasses two independent axes (cf. Manuscript received October 11, 2018 While the audio signal will often be processed into a sequence of features, This is a PREPRINT we consider this part of the solution, not of the task. JOURNAL OF SELECTED TOPICS OF SIGNAL PROCESSING, VOL.