Gfeller, Beat
MicAugment: One-shot Microphone Style Transfer
Borsos, Zalán, Li, Yunpeng, Gfeller, Beat, Tagliasacchi, Marco
A crucial aspect for the successful deployment of audio-based models "in-the-wild" is the robustness to the transformations introduced by heterogeneous acquisition conditions. In this work, we propose a method to perform one-shot microphone style transfer. Given only a few seconds of audio recorded by a target device, MicAugment identifies the transformations associated to the input acquisition pipeline and uses the learned transformations to synthesize audio as if it were recorded under the same conditions as the target audio. We show that our method can successfully apply the style transfer to real audio and that it significantly increases model robustness when used as data augmentation in the downstream tasks.
Sense and Learn: Self-Supervision for Omnipresent Sensors
Saeed, Aaqib, Ungureanu, Victor, Gfeller, Beat
Learning general-purpose representations from multisensor data produced by the omnipresent sensing systems (or IoT in general) has numerous applications in diverse use areas. Existing purely supervised end-to-end deep learning techniques depend on the availability of a massive amount of well-curated data, acquiring which is notoriously difficult but required to achieve a sufficient level of generalization on a task of interest. In this work, we leverage the self-supervised learning paradigm towards realizing the vision of continual learning from unlabeled inputs. We present a generalized framework named Sense and Learn for representation or feature learning from raw sensory data. It consists of eight auxiliary tasks that can learn high-level and broadly useful features entirely from unannotated data without any human involvement in the tedious labeling process. We demonstrate the efficacy of our approach on several publicly available datasets from different domains and in various settings, including linear separability, semi-supervised or few shot learning, and transfer learning. Our methodology achieves results that are competitive with the supervised approaches and close the gap through fine-tuning a network while learning the downstream tasks in most cases. In particular, we show that the self-supervised network can be utilized as initialization to significantly boost the performance in a low-data regime with as few as 5 labeled instances per class, which is of high practical importance to real-world problems. Likewise, the learned representations with self-supervision are found to be highly transferable between related datasets, even when few labeled instances are available from the target domains. The self-learning nature of our methodology opens up exciting possibilities for on-device continual learning.
Self-supervised audio representation learning for mobile devices
Tagliasacchi, Marco, Gfeller, Beat, Quitry, Félix de Chaumont, Roblek, Dominik
We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular technique used to learn word embeddings, and aim at reconstructing a temporal spectrogram slice from past and future slices or, alternatively, at reconstructing the context of surrounding slices from the current slice. We focus our evaluation on small encoder architectures, which can be potentially run on mobile devices during both inference (re-using a common learned representation across multiple downstream tasks) and training (capturing the true data distribution without compromising users' privacy when combined with federated learning). We evaluate the quality of the embeddings produced by the self-supervised learning models, and show that they can be re-used for a variety of downstream tasks, and for some tasks even approach the performance of fully supervised models of similar size.