Montufar, Guido, Morton, Jason

We describe discrete restricted Boltzmann machines: probabilistic graphical models with bipartite interactions between visible and hidden discrete variables. Examples are binary restricted Boltzmann machines and discrete naive Bayes models. We detail the inference functions and distributed representations arising in these models in terms of configurations of projected products of simplices and normal fans of products of simplices. We bound the number of hidden variables, depending on the cardinalities of their state spaces, for which these models can approximate any probability distribution on their visible states to any given accuracy. In addition, we use algebraic methods and coding theory to compute their dimension.

Merkh, Thomas, Montúfar, Guido

In this chapter we take a look at the universal approximation question for stochastic feedforward neural networks. In contrast to deterministic networks, which represent mappings from a set of inputs to a set of outputs, stochastic networks represent mappings from a set of inputs to a set of probability distributions over the set of outputs. In particular, even if the sets of inputs and outputs are finite, the class of stochastic mappings in question is not finite. Moreover, while for a deterministic function the values of all output variables can be computed independently of each other given the values of the inputs, in the stochastic setting the values of the output variables may need to be correlated, which requires that their values are computed jointly. A prominent class of stochastic feedforward networks which has played a key role in the resurgence of deep learning are deep belief networks. The representational power of these networks has been studied mainly in the generative setting, as models of probability distributions without an input, or in the discriminative setting for the special case of deterministic mappings. We study the representational power of deep sigmoid belief networks in terms of compositions of linear transformations of probability distributions, Markov kernels, that can be expressed by the layers of the network. We investigate different types of shallow and deep architectures, and the minimal number of layers and units per layer that are sufficient and necessary in order for the network to be able to approximate any given stochastic mapping from the set of inputs to the set of outputs arbitrarily well.

We establish upper bounds for the minimal number of hidden units for which a binary stochastic feedforward network with sigmoid activation probabilities and a single hidden layer is a universal approximator of Markov kernels. We show that each possible probabilistic assignment of the states of $n$ output units, given the states of $k\geq1$ input units, can be approximated arbitrarily well by a network with $2^{k-1}(2^{n-1}-1)$ hidden units.

We show that deep narrow Boltzmann machines are universal approximators of probability distributions on the activities of their visible units, provided they have sufficiently many hidden layers, each containing the same number of units as the visible layer. We show that, within certain parameter domains, deep Boltzmann machines can be studied as feedforward networks. We provide upper and lower bounds on the sufficient depth and width of universal approximators. These results settle various intuitions regarding undirected networks and, in particular, they show that deep narrow Boltzmann machines are at least as compact universal approximators as narrow sigmoid belief networks and restricted Boltzmann machines, with respect to the currently available bounds for those models.

We generalize recent theoretical work on the minimal number of layers of narrow deep belief networks that can approximate any probability distribution on the states of their visible units arbitrarily well. We relax the setting of binary units (Sutskever and Hinton, 2008; Le Roux and Bengio, 2008, 2010; Mont\'ufar and Ay, 2011) to units with arbitrary finite state spaces, and the vanishing approximation error to an arbitrary approximation error tolerance. For example, we show that a $q$-ary deep belief network with $L\geq 2+\frac{q^{\lceil m-\delta \rceil}-1}{q-1}$ layers of width $n \leq m + \log_q(m) + 1$ for some $m\in \mathbb{N}$ can approximate any probability distribution on $\{0,1,\ldots,q-1\}^n$ without exceeding a Kullback-Leibler divergence of $\delta$. Our analysis covers discrete restricted Boltzmann machines and na\"ive Bayes models as special cases.