Morocutti, Tobias
Exploring Performance-Complexity Trade-Offs in Sound Event Detection
Morocutti, Tobias, Schmid, Florian, Greif, Jonathan, Foscarin, Francesco, Widmer, Gerhard
We target the problem of developing new low-complexity networks for the sound event detection task. Our goal is to meticulously analyze the performance-complexity trade-off, aiming to be competitive with the large state-of-the-art models, at a fraction of the computational requirements. We find that low-complexity convolutional models previously proposed for audio tagging can be effectively adapted for event detection (which requires frame-wise prediction) by adjusting convolutional strides, removing the global pooling, and, importantly, adding a sequence model before the (now frame-wise) classification heads. Systematic experiments reveal that the best choice for the sequence model type depends on which complexity metric is most important for the given application. We also investigate the impact of enhanced training strategies such as knowledge distillation. In the end, we show that combined with an optimized training strategy, we can reach event detection performance comparable to state-of-the-art transformers while requiring only around 5% of the parameters. We release all our pre-trained models and the code for reproducing this work to support future research in low-complexity sound event detection at https://github.com/theMoro/EfficientSED.
Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification
Morocutti, Tobias, Schmid, Florian, Koutini, Khaled, Widmer, Gerhard
The DCASE23 challenge's [1] Low-Complexity Acoustic Scene Classificat ion task focuses on utilizing the TAU Urban Acoustic Scenes 2022 Mobile development dataset (TAU22) [2]. This dataset comprises one-second audio snippets from ten distinct acoustic scenes. In an attempt to make the models deployable on edge devices, a comple xity limit on the models is enforced: models are constrained to ha ve no more than 128,000 parameters and 30 million multiply-accum ulate operations (MMACs) for the inference of a 1-second audio sni p-pet. Among other model compression techniques such as Quantization [3] and Pruning [4], Knowledge Distillation (KD) [ 5-7] proved to be a particularly well-suited technique to improv e the performance of a low-complexity model in ASC. In a standard KD setting, a low-complexity model learns to mimic the teacher by minimizing a weighted sum of hard label l oss and distillation loss. While the soft targets are usually ob tained by one or multiple possibly complex teacher models, the distil lation loss tries to match the student predictions with the compute d soft targets based on the Kullback-Leibler divergence. Jung et al. [8] demonstrate that soft targets in a teacher-st udent setup benefit the learning process since one-hot labels do no t reflect the blurred decision boundaries between different acousti c scenes. Knowledge distillation has also been a very popular method i n the DCASE challenge submissions.
Device-Robust Acoustic Scene Classification via Impulse Response Augmentation
Morocutti, Tobias, Schmid, Florian, Koutini, Khaled, Widmer, Gerhard
The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.