Rister, Blaine, Rubin, Daniel L

Although artificial neural networks have shown great promise in applications including computer vision and speech recognition, there remains considerable practical and theoretical difficulty in optimizing their parameters. The seemingly unreasonable success of gradient descent methods in minimizing these non-convex functions remains poorly understood. In this work we offer some theoretical guarantees for networks with piecewise affine activation functions, which have in recent years become the norm. We prove three main results. Firstly, that the network is piecewise convex as a function of the input data. Secondly, that the network, considered as a function of the parameters in a single layer, all others held constant, is again piecewise convex. Finally, that the network as a function of all its parameters is piecewise multi-convex, a generalization of biconvexity. From here we characterize the local minima and stationary points of the training objective, showing that they minimize certain subsets of the parameter space. We then analyze the performance of two optimization algorithms on multi-convex problems: gradient descent, and a method which repeatedly solves a number of convex sub-problems. We prove necessary convergence conditions for the first algorithm and both necessary and sufficient conditions for the second, after introducing regularization to the objective. Finally, we remark on the remaining difficulty of the global optimization problem. Under the squared error objective, we show that by varying the training data, a single rectifier neuron admits local minima arbitrarily far apart, both in objective value and parameter space.

Chung, A. G., Shafiee, M. J., Wong, A.

The approximation of nonlinear kernels via linear feature maps has recently gained interest due to their applications in reducing the training and testing time of kernel-based learning algorithms. Current random projection methods avoid the curse of dimensionality by embedding the nonlinear feature space into a low dimensional Euclidean space to create nonlinear kernels. We introduce a Layered Random Projection (LaRP) framework, where we model the linear kernels and nonlinearity separately for increased training efficiency. The proposed LaRP framework was assessed using the MNIST hand-written digits database and the COIL-100 object database, and showed notable improvement in object classification performance relative to other state-of-the-art random projection methods.

Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powerful techniques to sample from almost arbitrary distributions. The flaw in practice is that it can take a large and/or unknown amount of time to converge to the stationary distribution. This paper gives sufficient conditions to guarantee that univariate Gibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a precise sense. Further, an algorithm is given to project onto this set of fast-mixing parameters in the Euclidean norm. Following recent work, we give an example use of this to project in various divergence measures, comparing univariate marginals obtained by sampling after projection to common variational methods and Gibbs sampling on the original parameters.

Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powerful techniques to sample from almost arbitrary distributions. The flaw in practice is that it can take a large and/or unknown amount of time to converge to the stationary distribution. This paper gives sufficient conditions to guarantee that univariate Gibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a precise sense. Further, an algorithm is given to project onto this set of fast-mixing parameters in the Euclidean norm. Following recent work, we give an example use of this to project in various divergence measures, comparing of univariate marginals obtained by sampling after projection to common variational methods and Gibbs sampling on the original parameters.

Chamroukhi, Faicel, Samé, Allou, Govaert, Gérard, Aknin, Patrice

A new approach for functional data description is proposed in this paper. It consists of a regression model with a discrete hidden logistic process which is adapted for modeling curves with abrupt or smooth regime changes. The model parameters are estimated in a maximum likelihood framework through a dedicated Expectation Maximization (EM) algorithm. From the proposed generative model, a curve discrimination rule is derived using the Maximum A Posteriori rule. The proposed model is evaluated using simulated curves and real world curves acquired during railway switch operations, by performing comparisons with the piecewise regression approach in terms of curve modeling and classification.