It is a very well-designed library that clearly abides by its guiding principles of modularity and extensibility, enabling us to easily assemble powerful, complex models from primitive building blocks. This has been demonstrated in numerous blog posts and tutorials, in particular, the excellent tutorial on Building Autoencoders in Keras. As the name suggests, that tutorial provides examples of how to implement various kinds of autoencoders in Keras, including the variational autoencoder (VAE) [1]. Visualization of 2D manifold of MNIST digits (left) and the representation of digits in latent space colored according to their digit labels (right). Like all autoencoders, the variational autoencoder is primarily used for unsupervised learning of hidden representations.

The goal of the variational autoencoder (VAE) is to learn a probability distribution $Pr(\mathbf{x})$ over a multi-dimensional variable $\mathbf{x}$. There are two main reasons for modelling distributions. First, we might want to draw samples (generate) from the distribution to create new plausible values of $\mathbf{x}$. Second, we might want to measure the likelihood that a new vector $\mathbf{x} {*}$ was created by this probability distribution. In fact, it turns out that the variational autoencoder is well-suited to the former task but not for the latter. It is common to talk about the variational autoencoder as if it is the model of $Pr(\mathbf{x})$. However, this is misleading; the variational autoencoder is a neural architecture that is designed to help learn the model for $Pr(\mathbf{x})$.

Figure 1: A two-dimensional convex function represented via contour lines. The function value is constant on the boundary of each such ellipse, and decreases as the ellipse becomes smaller and smaller. Let us assume we want to minimize this function starting from a point \(A\). The red line shows the path followed by a gradient descent optimizer converging to the minimum point \(B\), while the green dashed line represents the direct line joining \(A\) and \(B\). In today's post, we will discuss an interesting property concerning the trajectory of gradient descent iterates, namely the length of the Gradient Descent curve.

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an over-complete set of basis vectors to represent input vectors \mathbf{x}\in\mathbb{R} n (i.e. The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. Therefore, in sparse coding, we introduce the additional criterion of sparsity to resolve the degeneracy introduced by over-completeness. Here, we define sparsity as having few non-zero components or having few components not close to zero.

This repository contains both MATLAB and R code for implementing the Bayesian GP-LVM. The MATLAB code is in the subdirectory vargplvm, the R code in vargplvmR. For a quick description and sample videos / demos check: http://git.io/A3Uv The Bayesian GP-LVM (Titsias and Lawrence, 2010) is an extension of the traditional GP-LVM where the latent space is approximately marginalised out in a variational fashion (hence the prefix'vargplvm'). Let us denote \mathbf{Y} as a matrix of observations (here called outputs) with dimensions n \times p, where n rows correspond to datapoints and p columns to dimensions.