The capabilities of artificial intelligence (AI) are growing exponentially, especially in the area of creating synthetic images that are photorealistic. In 2014, generative adversarial networks (GANs) were introduced. A few years later, bidirectional GANs (BiGANs) were created. Then came along BigGANs that outperformed state-of-the-art GANs in image synthesis. But wait, there's more: Last week researchers from Alphabet Inc.'s DeepMind debuted BigBiGANs.
Deep generative modeling has aroused a lot of interest as a method for data generation and representation learning. Consider the observed real data X from an unknown distribution p r on X R d and the latent variable Z with a known prior p z on Z R k . In unidirectional data generation, we are interested in learning a transformation G: Z E X so that the distribution of the transformed variable G(Z, ɛ) becomes close to p r, where ɛ E is the source of randomness with a specified distribution p ɛ and G is referred to as a generator. In many applications, bidirectional generative modeling is favored due to the ability to learn representations, where we additionally learn a transformation E: X E Z, known as an encoder. The principled formulation of bidirectional generation is to match the distributions of two data-latent pairs (X, E(X, ɛ)) and (G(Z, ɛ), Z). Classical methods including Variational Autoencoder (VAE)  and Bidirectional Generative Adversarial Networks (BiGAN) [2,3] turn out to handle this task using one specific distance measure as the objective. In this paper, we generally consider the f-divergence which is a natural and broad class of distance measures. We discuss the advantages of this general formulation in several concerned issues including unidirectional generation, mode coverage and cycle consistency, especially for the Kullback-Leibler (KL) divergence which is our main choice. For optimization, both VAE and BiGAN are limited to specific divergences and assumptions for the encoder and generator distributions, and hence do not apply in our general formulation.
Generative models with an encoding component such as autoencoders currently receive great interest. However, training of autoencoders is typically complicated by the need to train a separate encoder and decoder model that have to be enforced to be reciprocal to each other. Here, we propose to use the by-design reversible neural networks (RevNets) as a new class of generative models. We investigate the generative performance of RevNets on the CelebA dataset, showing that generative RevNets can generate coherent faces with similar quality as Variational Autoencoders. This first attempt to use RevNets as a generative model slightly underperformed relative to recent advanced generative models using an autoencoder component on CelebA, but this gap may diminish with further optimization of the training setup of generative RevNets. In addition to the experiments on CelebA, we show a proof-of-principle experiment on the MNIST dataset suggesting that adversary-free trained RevNets can discover meaningful latent dimensions without pre-specifying the number of dimensions of the latent sampling distribution. In summary, this study shows that RevNets enable generative applications with an encoding component while overcoming the need to train a separate encoder and decoder model.
It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.