If you aren't familiar with Generative Adversarial Networks (GANs), they are a massively popular generative modeling technique formed by pitting two Deep Neural Networks, a generator and a discriminator, against each other. This adversarial loss has sparked the interest of many Deep Learning and Artificial Intelligence researchers. However, despite the beauty of the GAN formulation and the eye-opening results of the state-of-the-art architectures, GANs are generally very difficult to train. One of the best ways to get better results with GANs are to provide class labels. This is the basis of the conditional-GAN model.
If I understand correctly, both CPC and AlexNet used the same set of training images. CPC just didn't use labels, while AlexNet did. So, what about instances where a self-supervised network can be trained on 10,000x as much data as would be economically feasible to label? In these cases, are supervised learning's days numbered? The application I'm personally most interested in is self-driving cars.
Unsupervised learning is an old and well-understood problem in machine learning; LeCun's choice to replace it as the star in his cake analogy is not something he should take lightly! If you dive into the definition of self-supervised learning, you'll begin to see that it's really just an approach to unsupervised learning. Since many of the breakthroughs in machine learning this decade have been based on supervised learning techniques, successes in unsupervised problems tend to emerge when researchers re-frame an unsupervised problem as a supervised problem. Specifically, in self-supervised learning, we find a clever way to generate labels without human annotators. An easy example is a technique called next-step prediction.
We present an electrocardiogram (ECG) -based emotion recognition system using self-supervised learning. Our proposed architecture consists of two main networks, a signal transformation recognition network and an emotion recognition network. First, unlabelled data are used to successfully train the former network to detect specific pre-determined signal transformations in the self-supervised learning step. Next, the weights of the convolutional layers of this network are transferred to the emotion recognition network, and two dense layers are trained in order to classify arousal and valence scores. We show that our self-supervised approach helps the model learn the ECG feature manifold required for emotion recognition, performing equal or better than the fully-supervised version of the model. Our proposed method outperforms the state-of-the-art in ECG-based emotion recognition with two publicly available datasets, SWELL and AMIGOS. Further analysis highlights the advantage of our self-supervised approach in requiring significantly less data to achieve acceptable results.
Deep learning methods are successfully used in applications pertaining to ubiquitous computing, health, and well-being. Specifically, the area of human activity recognition (HAR) is primarily transformed by the convolutional and recurrent neural networks, thanks to their ability to learn semantic representations from raw input. However, to extract generalizable features, massive amounts of well-curated data are required, which is a notoriously challenging task; hindered by privacy issues, and annotation costs. Therefore, unsupervised representation learning is of prime importance to leverage the vast amount of unlabeled data produced by smart devices. In this work, we propose a novel self-supervised technique for feature learning from sensory data that does not require access to any form of semantic labels. We learn a multi-task temporal convolutional network to recognize transformations applied on an input signal. By exploiting these transformations, we demonstrate that simple auxiliary tasks of the binary classification result in a strong supervisory signal for extracting useful features for the downstream task. We extensively evaluate the proposed approach on several publicly available datasets for smartphone-based HAR in unsupervised, semi-supervised, and transfer learning settings. Our method achieves performance levels superior to or comparable with fully-supervised networks, and it performs significantly better than autoencoders. Notably, for the semi-supervised case, the self-supervised features substantially boost the detection rate by attaining a kappa score between 0.7-0.8 with only 10 labeled examples per class. We get similar impressive performance even if the features are transferred from a different data source. While this paper focuses on HAR as the application domain, the proposed technique is general and could be applied to a wide variety of problems in other areas.