AITopics | Schlüter, Jan

Plotting

Schlüter, Jan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

Koutini, Khaled, Masoudian, Shahed, Schmid, Florian, Eghbal-zadeh, Hamid, Schlüter, Jan, Widmer, Gerhard

arXiv.org Artificial IntelligenceMar-2-2023

The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2211.13956

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Upper Austria (0.14)

Genre: Research Report > New Finding (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient Training of Audio Transformers with Patchout

Koutini, Khaled, Schlüter, Jan, Eghbal-zadeh, Hamid, Widmer, Gerhard

arXiv.org Artificial IntelligenceMar-29-2022

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2022-227

2110.05069

Country:

Europe > Austria (0.69)
North America > United States (0.46)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Over-Parameterization and Generalization in Audio Classification

Koutini, Khaled, Eghbal-zadeh, Hamid, Henkel, Florian, Schlüter, Jan, Widmer, Gerhard

arXiv.org Machine LearningJul-19-2021

Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene classification (DCASE) community. In this study, we investigate the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities. Specifically, we test scaling CNNs in width and depth, under different conditions. Our results indicate that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.

deep learning, generalization, neural network, (18 more...)

arXiv.org Machine Learning

2107.08933

Country:

North America > United States (0.47)
Europe > Austria > Upper Austria (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Deep Learning for Audio Signal Processing

Purwins, Hendrik, Li, Bo, Virtanen, Tuomas, Schlüter, Jan, Chang, Shuo-yiin, Sainath, Tara

arXiv.org Machine LearningApr-30-2019

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Abstract--Given the recent surge in developments of deep x learning, this article provides a review of the state-of-the-art input sequence deep learning techniques for audio signal processing. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic The number of labels to be predicted (left), and the type of each label (right). While many deep learning methods have been adopted from I. INTRODUCTION Audio [2] in 1986, and finally 3) the success of deep learning in signals are commonly transformed into two-dimensional timefrequency speech recognition [3] and image classification [4] in 2012, representations for processing, but the two axes, leading to a renaissance of deep learning, involving e.g. Images are instantaneous snapshots networks (CNNs, [6]) and long short-term memory (LSTM, of a target and often analyzed as a whole or in patches [7]). In this "deep" paradigm, architectures with a large number with little order constraints; however audio signals have to be of parameters are trained to learn from a massive amount of studied sequentially in chronological order. METHODS many areas of signal processing, often outperforming traditional To set the stage, we give a conceptual overview of audio signal processing on a large scale. In this most recent analysis and synthesis problems (II-A), the input representations wave, deep learning first gained traction in image processing commonly used to address them (II-B), and the models [4], but was then widely adopted in speech processing, music shared between different application fields (II-C). H. Purwins is with Department of Architecture, Design & Media Technology, This division encompasses two independent axes (cf. Manuscript received October 11, 2018 While the audio signal will often be processed into a sequence of features, This is a PREPRINT we consider this part of the solution, not of the task. JOURNAL OF SELECTED TOPICS OF SIGNAL PROCESSING, VOL.

deep learning, neural network, signal processing, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/JSTSP.2019.2908700

1905.00078

Country:

North America > United States (0.14)
Europe > Denmark (0.14)

Genre:

Overview (0.86)
Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)

Add feedback