AITopics

A new learning algorithm is developed for the design of statistical classifiers minimizing the rate of misclassification. The method, which is based on ideas from information theory and analogies to statistical physics, assigns data to classes in probability. The distributions are chosen to minimize the expected classification error while simultaneously enforcing the classifier's structure and a level of "randomness" measured by Shannon's entropy. Achievement of the classifier structure is quantified by an associated cost. The constrained optimization problem is equivalent to the minimization of a Helmholtz free energy, and the resulting optimization method is a basic extension of the deterministic annealing algorithm that explicitly enforces structural constraints on assignments while reducing the entropy and expected cost with temperature. In the limit of low temperature, the error rate is minimized directly and a hard classifier with the requisite structure is obtained. This learning algorithm can be used to design a variety of classifier structures. The approach is compared with standard methods for radial basis function design and is demonstrated to substantially outperform other design methods on several benchmark examples, while often retaining design complexity comparable to, or only moderately greater than that of strict descent-based methods.

classifier, information-theoretic learning algorithm, probability, (13 more...)

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.14)
North America > United States > Texas (0.04)
North America > United States > Pennsylvania > Centre County > State College (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Waterhouse, Steve R., Robinson, Anthony J.

Constructive Algorithms for Hierarchical Mixtures of Experts

By applying a likelihood splitting criteria to each expert in the HME we "grow" the tree adaptively during training. Secondly, by considering only the most probable path through the tree we may "prune" branches away, either temporarily, or permanently if they become redundant. We demonstrate results for the growing and path pruning algorithms which show significant speed ups and more efficient use of parameters over the standard fixed structure in discriminating between two interlocking spirals and classifying 8-bit parity patterns. INTRODUCTION The HME (Jordan & Jacobs 1994) is a tree structured network whose terminal nodes are simple function approximators in the case of regression or classifiers in the case of classification. The outputs of the terminal nodes or experts are recursively combined upwards towards the root node, to form the overall output of the network, by "gates" which are situated at the non-terminal nodes.

constructive algorithm, hierarchical mixture, node, (15 more...)

Country:

Asia > Middle East > Jordan (0.25)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Learning long-term dependencies is not as difficult with NARX networks

Lin, Tsungnan, Horne, Bill G., Tiño, Peter, Giles, C. Lee

It has recently been shown that gradient descent learning algorithms for recurrent neural networks can perform poorly on tasks that involve long-term dependencies. In this paper we explore this problem for a class of architectures called NARX networks, which have powerful representational capabilities. Previous work reported that gradient descent learning is more effective in NARX networks than in recurrent networks with "hidden states". We show that although NARX networks do not circumvent the problem of long-term dependencies, they can greatly improve performance on such problems. We present some experimental'results that show that NARX networks can often retain information for two to three times as long as conventional recurrent networks.

dependency, long-term dependency, narx network, (15 more...)

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.14)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Florida > Orange County > Orlando (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Walter, Jörg A., Ritter, Helge

Investment Learning with Hierarchical PSOMs

We propose a hierarchical scheme for rapid learning of context dependent "skills" that is based on the recently introduced "Parameterized Self Organizing Map" ("PSOM"). The underlying idea is to first invest some learning effort to specialize the system into a rapid learner for a more restricted range of contexts. The specialization is carried out by a prior "investment learning stage", during which the system acquires a set of basis mappings or "skills" for a set of prototypical contexts. Adaptation of a "skill" to a new context can then be achieved by interpolating in the space of the basis mappings and thus can be extremely rapid. We demonstrate the potential of this approach for the task of a 3D visuomotor map for a Puma robot and two cameras. This includes the forward and backward robot kinematics in 3D end effector coordinates, the 2D 2D retina coordinates and also the 6D joint angles. After the investment phase the transformation can be learned for a new camera setup with a single observation.

mapping, meta-psom, t-psom, (14 more...)

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Germany (0.04)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Schraudolph, Nicol N., Sejnowski, Terrence J.

Tempering Backpropagation Networks: Not All Weights are Created Equal

Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for ''tempering'' each weight accordingly. In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that ofthe residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights. This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.

global learning rate, learning rate, tempering backpropagation network, (14 more...)

Country:

North America > United States > Colorado > Denver County > Denver (0.05)
North America > United States > California > San Diego County > San Diego (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(3 more...)

Industry: Education (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.65)

The Capacity of a Bump

Flake, Gary William

Recently, several researchers have reported encouraging experimental results when using Gaussian or bump-like activation functions in multilayer perceptrons. Networks of this type usually require fewer hidden layers and units and often learn much faster than typical sigmoidal networks. To explain these results we consider a hyper-ridge network, which is a simple perceptron with no hidden units and a rid¥e activation function. If we are interested in partitioningp points in d dimensions into two classes then in the limit as d approaches infinity the capacity of a hyper-ridge and a perceptron is identical.

dichotomy, perceptron, simple perceptron, (15 more...)

Country:

North America > United States > Maryland > Prince George's County > College Park (0.15)
North America > United States > New York (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (1.00)

Rebotier, Thomas P., Elman, Jeffrey L.

Explorations with the Dynamic Wave Model

Following Shrager and Johnson (1995) we study growth of logical function complexity in a network swept by two overlapping waves: one of pruning, and the other of Hebbian reinforcement of connections. Results indicate a significant spatial gradient in the appearance of both linearly separable and non linearly separable functions of the two inputs of the network; the n.l.s. cells are much sparser and their slope of appearance is sensitive to parameters in a highly nonlinear way.

exploration, logical function, probability, (13 more...)

Country: North America > United States > California > San Diego County > La Jolla (0.05)

Technology: Information Technology > Artificial Intelligence (0.71)

Ormoneit, Dirk, Tresp, Volker

Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging

We compare two regularization methods which can be used to improve the generalization capabilities of Gaussian mixture density estimates. The first method uses a Bayesian prior on the parameter space. We derive EM (Expectation Maximization) update rules which maximize the a posterior parameter probability. In the second approach we apply ensemble averaging to density estimation. This includes Breiman's "bagging", which recently has been found to produce impressive results for classification networks.

improved gaussian mixture density estimate, parameter estimate, regularization, (9 more...)

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.05)
North America > United States > California (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.70)

Opitz, David W., Shavlik, Jude W.

Generating Accurate and Diverse Members of a Neural-Network Ensemble

In particular, combining separately trained neural networks (commonly referred to as a neural-network ensemble) has been demonstrated to be particularly successful (Alpaydin, 1993; Drucker et al., 1994; Hansen and Salamon, 1990; Hashem et al., 1994; Krogh and Vedelsby, 1995; Maclin and Shavlik, 1995; Perrone, 1992). Both theoretical (Hansen and Salamon, 1990; Krogh and Vedelsby, 1995) and empirical (Hashem et al., 1994; 536 D. W. OPITZ, J. W. SHA VLIK Maclin and Shavlik, 1995) work has shown that a good ensemble is one where the individual networks are both accurate and make their errors on different parts of the input space; however, most previous work has either focussed on combining the output of multiple trained networks or only indirectly addressed how we should generate a good set of networks.

algorithm, ensemble, neural network, (13 more...)

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > New Jersey > Middlesex County > New Brunswick (0.05)
North America > United States > Minnesota > St. Louis County > Duluth (0.04)
(6 more...)

Genre: Research Report (0.69)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Jaakkola, Tommi, Saul, Lawrence K., Jordan, Michael I.

Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks

Sigmoid type belief networks, a class of probabilistic neural networks, provide a natural framework for compactly representing probabilistic information in a variety of unsupervised and supervised learning problems. Often the parameters used in these networks need to be learned from examples. Unfortunately, estimating the parameters via exact probabilistic calculations (i.e, the EMalgorithm) is intractable even for networks with fairly small numbers of hidden units. We propose to avoid the infeasibility of the E step by bounding likelihoods instead of computing them exactly. We introduce extended and complementary representations for these networks and show that the estimation of the network parameters can be made fast (reduced to quadratic optimization) by performing the estimation in either of the alternative domains.

bounding likelihood, probability, representation, (14 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.15)
Asia > Middle East > Jordan (0.06)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)