Goto

Collaborating Authors

 sigmoid network


Hamiltonian Monte Carlo on ReLU Neural Networks is Inefficient

arXiv.org Machine Learning

We analyze the error rates of the Hamiltonian Monte Carlo algorithm with leapfrog integrator for Bayesian neural network inference. We show that due to the non-differentiability of activation functions in the ReLU family, leapfrog HMC for networks with these activation functions has a large local error rate of $\Omega(\epsilon)$ rather than the classical error rate of $O(\epsilon^3)$. This leads to a higher rejection rate of the proposals, making the method inefficient. We then verify our theoretical findings through empirical simulations as well as experiments on a real-world dataset that highlight the inefficiency of HMC inference on ReLU-based neural networks compared to analytical networks.


Composite Optimization Algorithms for Sigmoid Networks

arXiv.org Artificial Intelligence

In this paper, we use composite optimization algorithms to solve sigmoid networks. We equivalently transfer the sigmoid networks to a convex composite optimization and propose the composite optimization algorithms based on the linearized proximal algorithms and the alternating direction method of multipliers. Under the assumptions of the weak sharp minima and the regularity condition, the algorithm is guaranteed to converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. Furthermore, the convergence results can be directly related to the amount of training data and provide a general guide for setting the size of sigmoid networks. Numerical experiments on Franke's function fitting and handwritten digit recognition show that the proposed algorithms perform satisfactorily and robustly.


Associative Memory in Iterated Overparameterized Sigmoid Autoencoders

arXiv.org Machine Learning

Recent work showed that overparameterized autoencoders can be trained to implement associative memory via iterative maps, when the trained input-output Jacobian of the network has all of its eigenvalue norms strictly below one. Here, we theoretically analyze this phenomenon for sigmoid networks by leveraging recent developments in deep learning theory, especially the correspondence between training neural networks in the infinite-width limit and performing kernel regression with the Neural Tangent Kernel (NTK). We find that overparameterized sigmoid autoencoders can have attractors in the NTK limit for both training with a single example and multiple examples under certain conditions. In particular, for multiple training examples, we find that the norm of the largest Jacobian eigenvalue drops below one with increasing input norm, leading to associative memory.


Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

arXiv.org Machine Learning

Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.



A Variational Mean-Field Theory for Sigmoidal Belief Networks

Neural Information Processing Systems

In this paper we will discuss a variational mean-field theory and its application to BNs, sigmoidal BNs in particular. We present a variational derivation of the mean-field theory, proposed by Plefka[2].


A Variational Mean-Field Theory for Sigmoidal Belief Networks

Neural Information Processing Systems

In this paper we will discuss a variational mean-field theory and its application to BNs, sigmoidal BNs in particular. We present a variational derivation of the mean-field theory, proposed by Plefka[2].


A Variational Mean-Field Theory for Sigmoidal Belief Networks

Neural Information Processing Systems

In this paper we will discuss a variational mean-field theory and its application to BNs, sigmoidal BNs in particular. We present a variational derivation of the mean-field theory, proposed by Plefka[2].


For Valid Generalization the Size of the Weights is More Important than the Size of the Network

Neural Information Processing Systems

Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size of the Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networks perform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.


For Valid Generalization the Size of the Weights is More Important than the Size of the Network

Neural Information Processing Systems

Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size of the Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networks perform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.