Goto

Collaborating Authors

 Perceptrons


Differentiating Functions of the Jacobian with Respect to the Weights

Neural Information Processing Systems

For many problems, the correct behavior of a model depends not only on its input-output mapping but also on properties of its Jacobian matrix, the matrix of partial derivatives of the model's outputs with respect to its in(cid:173) puts. We introduce the J-prop algorithm, an efficient general method for computing the exact partial derivatives of a variety of simple functions of the Jacobian of a model with respect to its free parameters. The algorithm applies to any parametrized feedforward model, including nonlinear re(cid:173) gression, multilayer perceptrons, and radial basis function networks.


The Relaxed Online Maximum Margin Algorithm

Neural Information Processing Systems

We describe a new incremental algorithm for training linear thresh(cid:173) old functions: the Relaxed Online Maximum Margin Algorithm, or ROMMA. ROMMA can be viewed as an approximation to the algorithm that repeatedly chooses the hyperplane that classifies previously seen ex(cid:173) amples correctly with the maximum margin. It is known that such a maximum-margin hypothesis can be computed by minimizing the length of the weight vector subject to a number of linear constraints. ROMMA works by maintaining a relatively simple relaxation of these constraints that can be efficiently updated. We prove a mistake bound for ROMMA that is the same as that proved for the perceptron algorithm.


Efficient Learning of Linear Perceptrons

Neural Information Processing Systems

We consider the existence of efficient algorithms for learning the class of half-spaces in n in the agnostic learning model (Le., mak(cid:173) ing no prior assumptions on the example-generating distribution). The resulting combinatorial problem - finding the best agreement half-space over an input sample - is NP hard to approximate to within some constant factor. We suggest a way to circumvent this theoretical bound by introducing a new measure of success for such algorithms. An algorithm is IL-margin successful if the agreement ratio of the half-space it outputs is as good as that of any half-space once training points that are inside the IL-margins of its separating hyper-plane are disregarded. We prove crisp computational com(cid:173) plexity results with respect to this success measure: On one hand, for every positive IL, there exist efficient (poly-time) IL-margin suc(cid:173) cessful learning algorithms.


Convolution Kernels for Natural Language

Neural Information Processing Systems

We describe the application of kernel methods to Natural Language Pro- cessing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional rep- resentations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.


Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

Neural Information Processing Systems

Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection the(cid:173) ory such as AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estima(cid:173) tor and the Bayesian predictive distribution in terms of Gaussian random fields, by using simple models.


Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

Neural Information Processing Systems

We study online learning in Boolean domains using kernels which cap- ture feature expansions equivalent to using conjunctions over basic fea- tures. We demonstrate a tradeoff between the computational efficiency with which these kernels can be computed and the generalization abil- ity of the resulting classifier. We first describe several kernel functions which capture either limited forms of conjunctions or all conjunctions. We show that these kernels can be used to efficiently run the Percep- tron algorithm over an exponential number of conjunctions; however we also prove that using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple func- tions. We also consider an analogous use of kernel functions to run the multiplicative-update Winnow algorithm over an expanded feature space of exponentially many conjunctions.


Semi-Definite Programming by Perceptron Learning

Neural Information Processing Systems

We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a prob- abilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algo- rithm works, but is not competitive with state-of-the-art interior point methods.


Subject-Independent Magnetoencephalographic Source Localization by a Multilayer Perceptron

Neural Information Processing Systems

We describe a system that localizes a single dipole to reasonable accu- racy from noisy magnetoencephalographic (MEG) measurements in real time. At its core is a multilayer perceptron (MLP) trained to map sen- sor signals and head position to dipole location. Including head position overcomes the previous need to retrain the MLP for each subject and ses- sion. The training dataset was generated by mapping randomly chosen dipoles and head positions through an analytic model and adding noise from real MEG recordings. After training, a localization took 0.7 ms with an average error of 0.90 cm.


Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

Neural Information Processing Systems

The online incremental gradient (or backpropagation) algorithm is widely considered to be the fastest method for solving large-scale neural-network (NN) learning problems. In contrast, we show that an appropriately implemented iterative batch-mode (or block-mode) learning method can be much faster. For example, it is three times faster in the UCI letter classification problem (26 outputs, 16,000 data items, 6,066 parameters with a two-hidden-layer multilayer perceptron) and 353 times faster in a nonlinear regression problem arising in color recipe prediction (10 outputs, 1,000 data items, 2,210 parameters with a neuro-fuzzy modular network). The three principal innovative ingredients in our algorithm are the following: First, we use scaled trust-region regularization with inner-outer it- eration to solve the associated "overdetermined" nonlinear least squares problem, where the inner iteration performs a truncated (or inexact) Newton method. Second, we employ Pearlmutter's implicit sparse Hessian matrix-vector multiply algorithm to con- struct the Krylov subspaces used to solve for the truncated New- ton update. Third, we exploit sparsity (for preconditioning) in the matrices resulting from the NNs having many outputs.


The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

Neural Information Processing Systems

The Perceptron algorithm, despite its simplicity, often performs well on online classification tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a fixed memory budget. To our knowledge, this is the first online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound.