Goto

Collaborating Authors

 rbm







SM

Neural Information Processing Systems

First, let us recall that AIS is based on a simulated annealing process where a configuration is gradually brought from temperature T = to T = 1 using a set of bridging distributions. Foreach temperature, we define the transition operator, Tk(v0,v) to bring a configuration v to v0 varying the temperature according to the temperature schedule. In our case it is done using MC sampling layer-wise. In our work, we used a set of Nβ [104,105] temperatures uniformly distributed in this interval (dependingonthesystemsize). Inpractice,oneobservesthatERBM goesbelowED atlong sampling times if the machine was trained out of equilibrium.


Equilibriumandnon-Equilibriumregimesinthe learningofRestrictedBoltzmannMachines

Neural Information Processing Systems

Inparticular,weshowthat using the popular k (persistent) contrastive divergence approaches, with k small, the dynamics of the learned model are extremely slow and often dominated by strong out-of-equilibrium effects.



Learning Restricted Boltzmann Machines with Sparse Latent Variables

Neural Information Processing Systems

Restricted Boltzmann Machines (RBMs) are a common family of undirected graphical models with latent variables. An RBM is described by a bipartite graph, with all observed variables in one layer and all latent variables in the other. We consider the task of learning an RBM given samples generated according to it. The best algorithms for this task currently have time complexity $\tilde{O}(n^2)$ for ferromagnetic RBMs (i.e., with attractive potentials) but $\tilde{O}(n^d)$ for general RBMs, where $n$ is the number of observed variables and $d$ is the maximum degree of a latent variable. Let the \textit{MRF neighborhood} of an observed variable be its neighborhood in the Markov Random Field of the marginal distribution of the observed variables. In this paper, we give an algorithm for learning general RBMs with time complexity $\tilde{O}(n^{2^s+1})$, where $s$ is the maximum number of latent variables connected to the MRF neighborhood of an observed variable. This is an improvement when $s < \log_2 (d-1)$, which corresponds to RBMs with sparse latent variables. Furthermore, we give a version of this learning algorithm that recovers a model with small prediction error and whose sample complexity is independent of the minimum potential in the Markov Random Field of the observed variables. This is of interest because the sample complexity of current algorithms scales with the inverse of the minimum potential, which cannot be controlled in terms of natural properties of the RBM.


Equilibrium and non-Equilibrium regimes in the learning of Restricted Boltzmann Machines

Neural Information Processing Systems

Training Restricted Boltzmann Machines (RBMs) has been challenging for a long time due to the difficulty of computing precisely the log-likelihood gradient. Over the past decades, many works have proposed more or less successful recipes but without studying systematically the crucial quantity of the problem: the mixing time i.e. the number of MCMC iterations needed to sample completely new configurations from a model. In this work, we show that this mixing time plays a crucial role in the behavior and stability of the trained model, and that RBMs operate in two well-defined distinct regimes, namely equilibrium and out-of-equilibrium, depending on the interplay between this mixing time of the model and the number of MCMC steps, $k$, used to approximate the gradient. We further show empirically that this mixing time increases along the learning, which often implies a transition from one regime to another as soon as $k$ becomes smaller than this time.In particular, we show that using the popular $k$ (persistent) contrastive divergence approaches, with $k$ small, the dynamics of the fitted model are extremely slow and often dominated by strong out-of-equilibrium effects. On the contrary, RBMs trained in equilibrium display much faster dynamics, and a smooth convergence to dataset-like configurations during the sampling.Finally, we discuss how to exploit in practice both regimes depending on the task one aims to fulfill: (i) short $k$s can be used to generate convincing samples in short learning times, (ii) large $k$ (or increasingly large) must be used to learn the correct equilibrium distribution of the RBM. Finally, the existence of these two operational regimes seems to be a general property of energy based models trained via likelihood maximization.