Goto

Collaborating Authors

 Statistical Learning


Softassign versus Softmax: Benchmarks in Combinatorial Optimization

Neural Information Processing Systems

Steven Gold Department of Computer Science Yale University New Haven, CT 06520-8285 AnandRangarajan Dept. of Diagnostic Radiology Yale University New Haven, CT 06520-8042 Abstract A new technique, termed soft assign, is applied for the first time to two classic combinatorial optimization problems, the traveling salesmanproblem and graph partitioning. Softassign, which has emerged from the recurrent neural network/statistical physics framework, enforces two-way (assignment) constraints without the use of penalty terms in the energy functions. The softassign can also be generalized from two-way winner-take-all constraints to multiple membership constraints which are required for graph partitioning. Thesoftassign technique is compared to the softmax (Potts glass). Within the statistical physics framework, softmax and a penalty term has been a widely used method for enforcing the two-way constraints common within many combinatorial optimization problems.The benchmarks present evidence that softassign has clear advantages in accuracy, speed, parallelizabilityand algorithmic simplicityover softmax and a penalty term in optimization problems with two-way constraints. 1 Introduction In a series of papers in the early to mid 1980's, Hopfield and Tank introduced techniques which allowed one to solve combinatorial optimization problems with recurrent neural networks [Hopfield and Tank, 1985].


From Isolation to Cooperation: An Alternative View of a System of Experts

Neural Information Processing Systems

We introduce a constructive, incremental learning system for regression problems that models data by means of locally linear experts. In contrast to other approaches, the experts are trained independently and do not compete for data during learning. Only when a prediction for a query is required do the experts cooperate by blending their individual predictions. Eachexpert is trained by minimizing a penalized local cross validation errorusing second order methods. In this way, an expert is able to find a local distance metric by adjusting the size and shape of the receptive fieldin which its predictions are valid, and also to detect relevant input features by adjusting its bias on the importance of individual input dimensions. We derive asymptotic results for our method. In a variety of simulations the properties of the algorithm are demonstrated with respect to interference, learning speed, prediction accuracy, feature detection, and task oriented incremental learning.


An Information-theoretic Learning Algorithm for Neural Network Classification

Neural Information Processing Systems

A new learning algorithm is developed for the design of statistical classifiers minimizing the rate of misclassification. The method, which is based on ideas from information theory and analogies to statistical physics, assigns data to classes in probability. The distributions arechosen to minimize the expected classification error while simultaneously enforcing the classifier's structure and a level of "randomness" measured by Shannon's entropy. Achievement of the classifier structure is quantified by an associated cost. The constrained optimizationproblem is equivalent to the minimization of a Helmholtz free energy, and the resulting optimization method is a basic extension of the deterministic annealing algorithm that explicitly enforces structural constraints on assignments while reducing theentropy and expected cost with temperature. In the limit of low temperature, the error rate is minimized directly and a hard classifier with the requisite structure is obtained. This learning algorithmcan be used to design a variety of classifier structures. The approach is compared with standard methods for radial basis function design and is demonstrated to substantially outperform other design methods on several benchmark examples, while often retainingdesign complexity comparable to, or only moderately greater than that of strict descent-based methods.


Clustering data through an analogy to the Potts model

Neural Information Processing Systems

A new approach for clustering is proposed. This method is based on an analogy to a physical model; the ferromagnetic Potts model at thermal equilibrium is used as an analog computer for this hard optimization problem. We do not assume any structure of the underlying distributionof the data. Phase space of the Potts model is divided into three regions; ferromagnetic, super-paramagnetic and paramagnetic phases. The region of interest is that corresponding to the super-paramagnetic one, where domains of aligned spins appear.


Discriminant Adaptive Nearest Neighbor Classification and Regression

Neural Information Processing Systems

Nearest neighbor classification expects the class conditional probabilities tobe locally constant, and suffers from bias in high dimensions Wepropose a locally adaptive form of nearest neighbor classification to try to finesse this curse of dimensionality. We use a local linear discriminant analysis to estimate an effective metric forcomputing neighborhoods. We determine the local decision boundaries from centroid information, and then shrink neighborhoods indirections orthogonal to these local decision boundaries, and elongate them parallel to the boundaries. Thereafter, any neighborhood-based classifier can be employed, using the modified neighborhoods. We also propose a method for global dimension reduction, that combines local dimension information.


Family Discovery

Neural Information Processing Systems

"Family discovery" is the task of learning the dimension and structure ofa parameterized family of stochastic models. It is especially appropriatewhen the training examples are partitioned into "episodes" of samples drawn from a single parameter value. We present three family discovery algorithms based on surface learning andshow that they significantly improve performance over two alternatives on a parameterized classification task. 1 INTRODUCTION Human listeners improve their ability to recognize speech by identifying the accent of the speaker. "Might" in an American accent is similar to "mate" in an Australian accent. By first identifying the accent, discrimination between these two words is improved.


Worst-case Loss Bounds for Single Neurons

Neural Information Processing Systems

We analyze and compare the well-known Gradient Descent algorithm anda new algorithm, called the Exponentiated Gradient algorithm, for training a single neuron with an arbitrary transfer function . Both algorithms are easily generalized to larger neural networks, and the generalization of Gradient Descent is the standard back-propagationalgorithm. In this paper we prove worstcase loss bounds for both algorithms in the single neuron case. Since local minima make it difficult to prove worst-case bounds for gradient-based algorithms, we must use a loss function that prevents the formation of spurious local minima. We define such a matching loss function for any strictly increasing differentiable transfer function and prove worst-case loss bound for any such transfer function and its corresponding matching loss. For example, thematching loss for the identity function is the square loss and the matching loss for the logistic sigmoid is the entropic loss. The different structure of the bounds for the two algorithms indicates thatthe new algorithm outperforms Gradient Descent when the inputs contain a large number of irrelevant components.


Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks

Neural Information Processing Systems

Sollat CONNECT, The Niels Bohr Institute Blegdamsdvej 17 Copenhagen 2100, Denmark Abstract We consider the problem of online gradient descent learning for general two-layer neural networks. An analytic solution is presented andused to investigate the role of the learning rate in controlling theevolution and convergence of the learning process. Two-layer networks with an arbitrary number of hidden units have been shown to be universal approximators [1] for such N-to-one dimensional maps. We investigate the emergence of generalization ability in an online learning scenario [2], in which the couplings are modified after the presentation of each example so as to minimize the corresponding error. The resulting changes in {J} are described as a dynamical evolution; the number of examples plays the role of time.



Stable Dynamic Parameter Adaption

Neural Information Processing Systems

A stability criterion for dynamic parameter adaptation is given. In the case of the learning rate of backpropagation, a class of stable algorithms is presented and studied, including a convergence proof. 1 INTRODUCTION All but a few learning algorithms employ one or more parameters that control the quality of learning. Backpropagation has its learning rate and momentum parameter; Boltzmannlearning uses a simulated annealing schedule; Kohonen learning a learning rate and a decay parameter; genetic algorithms probabilities, etc. The investigator always has to set the parameters to specific values when trying to solve a certain problem. Traditionally, the metaproblem of adjusting the parameters is solved by relying on a set of well-tested values of other problems or an intensive search for good parameter regions by restarting the experiment with different values. Inthis situation, a great deal of expertise and/or time for experiment design is required (as well as a huge amount of computing time).