Perceptrons
Differentiating Functions of the Jacobian with Respect to the Weights
Flake, Gary William, Pearlmutter, Barak A.
For many problems, the correct behavior of a model depends not only on its input-output mapping but also on properties of its Jacobian matrix, the matrix of partial derivatives of the model's outputs with respect to its inputs. We introduce the J-prop algorithm, an efficient general method for computing the exact partial derivatives of a variety of simple functions of the Jacobian of a model with respect to its free parameters. The algorithm applies to any parametrized feedforward model, including nonlinear regression, multilayer perceptrons, and radial basis function networks.
The Relaxed Online Maximum Margin Algorithm
We describe a new incremental algorithm for training linear threshold functions:the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen examples correctlywith the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm. Our analysis implies that the more computationally intensive maximum-margin algorithm alsosatisfies this mistake bound; this is the first worst-case performance guaranteefor this algorithm. We describe some experiments using ROMMA and a variant that updates its hypothesis more aggressively as batch algorithms to recognize handwritten digits. The computational complexity and simplicity of these algorithms is similar to that of perceptron algorithm,but their generalization is much better. We describe a sense in which the performance of ROMMA converges to that of SVM in the limit if bias isn't considered.
Reconstruction of Sequential Data with Probabilistic Models and Continuity Constraints
We consider the problem of reconstructing a temporal discrete sequence of multidimensional real vectors when part of the data is missing, under the assumption that the sequence was generated by a continuous process. Aparticular case of this problem is multivariate regression, which is very difficult when the underlying mapping is one-to-many. We propose analgorithm based on a joint probability model of the variables of interest, implemented using a nonlinear latent variable model. Each point in the sequence is potentially reconstructed as any of the modes of the conditional distribution of the missing variables given the present variables (computed using an exhaustive mode search in a Gaussian mixture). Modeselection is determined by a dynamic programming search that minimises a geometric measure of the reconstructed sequence, derived fromcontinuity constraints. We illustrate the algorithm with a toy example and apply it to a real-world inverse problem, the acoustic-toarticulatory mapping.The results show that the algorithm outperforms conditional mean imputation and multilayer perceptrons. 1 Definition of the problem
Linear Hinge Loss and Average Margin
Gentile, Claudio, Warmuth, Manfred K. K.
We describe a unifying method for proving relative loss bounds for online linear threshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a continuous loss function, called the "linear hinge loss", that can be employed to derive the updates of the algorithms. We first prove bounds w.r.t. the linear hinge loss and then convert them to the discrete loss. We introduce a notion of "average margin" of a set of examples. We show how relative loss bounds based on the linear hinge loss can be converted to relative loss bounds i.t.o. the discrete loss using the average margin.
Linear Hinge Loss and Average Margin
Gentile, Claudio, Warmuth, Manfred K.
We describe a unifying method for proving relative loss bounds for online linearthreshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a continuous lossfunction, called the "linear hinge loss", that can be employed to derive the updates of the algorithms. We first prove bounds w.r.t. the linear hinge loss and then convert them to the discrete loss. We introduce anotion of "average margin" of a set of examples . We show how relative loss bounds based on the linear hinge loss can be converted to relative loss bounds i.t.o. the discrete loss using the average margin.
Regularisation in Sequential Learning Algorithms
Freitas, João F. G. de, Niranjan, Mahesan, Gee, Andrew H.
In this paper, we discuss regularisation in online/sequential learning algorithms. In environments where data arrives sequentially, techniques such as cross-validation to achieve regularisation or model selection are not possible. Further, bootstrapping to determine a confidence level is not practical. To surmount these problems, a minimum variance estimation approach that makes use of the extended Kalman algorithm for training multi-layer perceptrons is employed. The novel contribution of this paper is to show the theoretical links between extended Kalman filtering, Sutton's variable learning rate algorithms and Mackay's Bayesian estimation framework. In doing so, we propose algorithms to overcome the need for heuristic choices of the initial conditions and noise covariance matrices in the Kalman approach.
The Efficiency and the Robustness of Natural Gradient Descent Learning Rule
Yang, Howard Hua, Amari, Shun-ichi
The inverse of the Fisher information matrix is used in the natural gradient descent algorithm to train single-layer and multi-layer perceptrons. We have discovered a new scheme to represent the Fisher information matrix of a stochastic multi-layer perceptron. Based on this scheme, we have designed an algorithm to compute the natural gradient. When the input dimension n is much larger than the number of hidden neurons, the complexity of this algorithm is of order O(n). It is confirmed by simulations that the natural gradient descent learning rule is not only efficient but also robust.
Data-Dependent Structural Risk Minimization for Perceptron Decision Trees
Shawe-Taylor, John, Cristianini, Nello
This paper presents a neural-model of pre-attentive visual processing. The model explains why certain displays can be processed very fast, "in parallel", while others require slower, "serial" processing, in subsequent attentional systems. Our approach stems from the observation that the visual environment is overflowing with diverse information, but the biological information-processing systems analyzing it have a limited capacity [1]. This apparent mismatch suggests that data compression should be performed at an early stage of perception, and that via an accompanying process of dimension reduction, only a few essential features of the visual display should be retained. We propose that only parallel displays incorporate global features that enable fast target detection, and hence they can be processed pre-attentively, with all items (target and dis tractors) examined at once.