Goto

Collaborating Authors

 Technology


Learning Many Related Tasks at the Same Time with Backpropagation

Neural Information Processing Systems

Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific information can be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multi task backprop generalizes better in real domains.


Effects of Noise on Convergence and Generalization in Recurrent Networks

Neural Information Processing Systems

We introduce and study methods of inserting synaptic noise into dynamically-driven recurrent neural networks and show that applying a controlled amount of noise during training may improve convergence and generalization. In addition, we analyze the effects of each noise parameter (additive vs. multiplicative, cumulative vs. non-cumulative, per time step vs. per string) and predict that best overall performance can be achieved by injecting additive noise at each time step. Extensive simulations on learning the dual parity grammar from temporal strings substantiate these predictions.


Estimating Conditional Probability Densities for Periodic Variables

Neural Information Processing Systems

Many applications of neural networks can be formulated in terms of a multivariate nonlinear mapping from an input vector x to a target vector t. A conventional neural network approach, based on least squares for example, leads to a network mapping which approximates the regression of t on x. A more complete description of the data can be obtained by estimating the conditional probability density of t, conditioned on x, which we write as p(tlx). Various techniques exist for modelling such densities when the target variables live in a Euclidean space. However, a number of potential applications involve angle-like output variables which are periodic on some finite interval (usually chosen to be (0,271")).


An Alternative Model for Mixtures of Experts

Neural Information Processing Systems

We propose an alternative model for mixtures of experts which uses a different parametric form for the gating network. The modified model is trained by the EM algorithm. In comparison with earlier models-trained by either EM or gradient ascent-there is no need to select a learning stepsize. We report simulation experiments which show that the new architecture yields faster convergence. We also apply the new model to two problem domains: piecewise nonlinear function approximation and the combination of multiple previously trained classifiers. 1 INTRODUCTION For the mixtures of experts architecture (Jacobs, Jordan, Nowlan & Hinton, 1991), the EM algorithm decouples the learning process in a manner that fits well with the modular structure and yields a considerably improved rate of convergence (Jordan & Jacobs, 1994).


A Growing Neural Gas Network Learns Topologies

Neural Information Processing Systems

An incremental network model is introduced which is able to learn the important topological relations in a given set of input vectors by means of a simple Hebb-like learning rule. In contrast to previous approaches like the "neural gas" method of Martinetz and Schulten (1991, 1994), this model has no parameters which change over time and is able to continue learning, adding units and connections, until a performance criterion has been met. Applications of the model include vector quantization, clustering, and interpolation.


Factorial Learning and the EM Algorithm

Neural Information Processing Systems

Many real world learning problems are best characterized by an interaction of multiple independent causes or factors. Discovering such causal structure from the data is the focus of this paper. Based on Zemel and Hinton's cooperative vector quantizer (CVQ) architecture, an unsupervised learning algorithm is derived from the Expectation-Maximization (EM) framework. Due to the combinatorial nature of the data generation process, the exact E-step is computationally intractable. Two alternative methods for computing the E-step are proposed: Gibbs sampling and mean-field approximation, and some promising empirical results are presented.


Active Learning for Function Approximation

Neural Information Processing Systems

We develop a principled strategy to sample a function optimally for function approximation tasks within a Bayesian framework. Using ideas from optimal experiment design, we introduce an objective function (incorporating both bias and variance) to measure the degree of approximation, and the potential utility of the data points towards optimizing this objective. We show how the general strategy can be used to derive precise algorithms to select data for two cases: learning unit step functions and polynomial functions. In particular, we investigate whether such active algorithms can learn the target with fewer examples. We obtain theoretical and empirical results to suggest that this is the case. 1 INTRODUCTION AND MOTIVATION Learning from examples is a common supervised learning paradigm that hypothesizes a target concept given a stream of training examples that describes the concept. In function approximation, example-based learning can be formulated as synthesizing an approximation function for data sampled from an unknown target function (Poggio and Girosi, 1990). Active learning describes a class of example-based learning paradigms that seeks out new training examples from specific regions of the input space, instead of passively accepting examples from some data generating source.


Convergence Properties of the K-Means Algorithms

Neural Information Processing Systems

K-Means is a popular clustering algorithm used in many applications, including the initialization of more computationally expensive algorithms (Gaussian mixtures, Radial Basis Functions, Learning Vector Quantization and some Hidden Markov Models). The practice of this initialization procedure often gives the frustrating feeling that K-Means performs most of the task in a small fraction of the overall time. This motivated us to better understand this convergence speed. A second reason lies in the traditional debate between hard threshold (e.g.


SARDNET: A Self-Organizing Feature Map for Sequences

Neural Information Processing Systems

A self-organizing neural network for sequence classification called SARDNET is described and analyzed experimentally. SARDNET extends the Kohonen Feature Map architecture with activation retention and decay in order to create unique distributed response patterns for different sequences.


Interior Point Implementations of Alternating Minimization Training

Neural Information Processing Systems

AM techniques were first introduced in soft-competitive learning algorithms[l]. This training procedure was later shown to be closely related to Expectation-Maximization algorithms used by the statistical estimation community[2]. Alternating minimizations search for optimal network weights by breaking the search into two distinct minimization problems. A given network performance functional is extremalized first with respect to one set of network weights and then with respect to the remaining weights. These learning procedures have found applications in the training of local expert systems [3], and in Boltzmann machine training [4]. More recently, convergence rates have been derived by viewing the AM 570 Michael Lemmon.