Statistical Learning
Fool's Gold: Extracting Finite State Machines from Recurrent Network Dynamics
Several recurrent networks have been proposed as representations for the task of formal language learning. After training a recurrent network recognize aformal language or predict the next symbol of a sequence, the next logical step is to understand the information processing carried out by the network. Some researchers have begun to extracting finite state machines from the internal state trajectories of their recurrent networks. This paper describes how sensitivity to initial conditions and discrete measurements can trick these extraction methods to return illusory finite state descriptions.
Solvable Models of Artificial Neural Networks
Solvable models of nonlinear learning machines are proposed, and learning in artificial neural networks is studied based on the theory of ordinary differential equations. A learning algorithm is constructed, bywhich the optimal parameter can be found without any recursive procedure. The solvable models enable us to analyze the reason why experimental results by the error backpropagation often contradict the statistical learning theory.
Supervised Learning with Growing Cell Structures
Center positions are continuously updated through soft competitive learning. The width of the radial basis functions is derived from the distance to topological neighbors. During the training the observed error is accumulated locally and used to determine where to insert the next unit. This leads (in case of classification problems) to the placement of units near class borders rather than near frequency peaks as is done by most existing methods. The resulting networks need few training epochs and seem to generalize very well. This is demonstrated by examples.
Adaptive knot Placement for Nonparametric Regression
Najafi, Hossein L., Cherkassky, Vladimir
We show how an "Elman" network architecture, constructed from recurrently connected oscillatory associative memory network modules, canemploy selective "attentional" control of synchronization to direct the flow of communication and computation within the architecture to solve a grammatical inference problem. Previously we have shown how the discrete time "Elman" network algorithm can be implemented in a network completely described by continuous ordinary differential equations. The time steps (machine cycles)of the system are implemented by rhythmic variation (clocking) of a bifurcation parameter. In this architecture, oscillation amplitudecodes the information content or activity of a module (unit), whereas phase and frequency are used to "softwire" the network. Only synchronized modules communicate by exchanging amplitudeinformation; the activity of non-resonating modules contributes incoherent crosstalk noise. Attentional control is modeled as a special subset of the hidden modules with ouputs which affect the resonant frequencies of other hidden modules. They control synchrony among the other modules anddirect the flow of computation (attention) to effect transitions betweentwo subgraphs of a thirteen state automaton which the system emulates to generate a Reber grammar. The internal crosstalk noise is used to drive the required random transitions of the automaton.
Combined Neural Networks for Time Series Analysis
We propose a method for improving the performance of any network designedto predict the next value of a time series. Vve advocate analyzing the deviations of the network's predictions from the data in the training set. This can be carried out by a secondary network trainedon the time series of these residuals. The combined system of the two networks is viewed as the new predictor. We demonstrate the simplicity and success of this method, by applying itto the sunspots data. The small corrections of the secondary network can be regarded as resulting from a Taylor expansion of a complex network which includes the combined system.
A Comparison of Dynamic Reposing and Tangent Distance for Drug Activity Prediction
Dietterich, Thomas G., Jain, Ajay N., Lathrop, Richard H., Lozano-Pérez, Tomás
Thomas G. Dietterich Arris Pharmaceutical Corporation and Oregon State University Corvallis, OR 97331-3202 Ajay N. Jain Arris Pharmaceutical Corporation 385 Oyster Point Blvd., Suite 3 South San Francisco, CA 94080 Richard H. Lathrop and Tomas Lozano-Perez Arris Pharmaceutical Corporation and MIT Artificial Intelligence Laboratory 545 Technology Square Cambridge, MA 02139 Abstract In drug activity prediction (as in handwritten character recognition), thefeatures extracted to describe a training example depend on the pose (location, orientation, etc.) of the example. In handwritten characterrecognition, one of the best techniques for addressing thisproblem is the tangent distance method of Simard, LeCun and Denker (1993). Jain, et al. (1993a; 1993b) introduce a new technique-dynamic reposing-that also addresses this problem. Dynamicreposing iteratively learns a neural network and then reposes the examples in an effort to maximize the predicted output values.New models are trained and new poses computed until models and poses converge. This paper compares dynamic reposing to the tangent distance method on the task of predicting the biological activityof musk compounds.
Robust Parameter Estimation and Model Selection for Neural Network Regression
In this paper, it is shown that the conventional back-propagation (BPP) algorithm for neural network regression is robust to leverages (datawith:n corrupted), but not to outliers (data with y corrupted). A robust model is to model the error as a mixture of normal distribution. The influence function for this mixture model is calculated and the condition for the model to be robust to outliers is given. EM algorithm [5] is used to estimate the parameter. The usefulness of model selection criteria is also discussed.
Locally Adaptive Nearest Neighbor Algorithms
Wettschereck, Dietrich, Dietterich, Thomas G.
Four versions of a k-nearest neighbor algorithm with locally adaptive kare introduced and compared to the basic k-nearest neighbor algorithm (kNN). Locally adaptive kNN algorithms choose the value of k that should be used to classify a query by consulting the results of cross-validation computations in the local neighborhood of the query. Local kNN methods are shown to perform similar to kNN in experiments with twelve commonly used data sets. Encouraging resultsin three constructed tasks show that local methods can significantly outperform kNN in specific applications. Local methods can be recommended for online learning and for applications wheredifferent regions of the input space are covered by patterns solving different sub-tasks.
Assessing the Quality of Learned Local Models
Schaal, Stefan, Atkeson, Christopher G.
An approach is presented to learning high dimensional functions in the case where the learning algorithm can affect the generation of new data. A local modeling algorithm, locally weighted regression, is used to represent the learned function. Architectural parameters of the approach, such as distance metrics, are also localized and become a function of the query point instead of being global. Statistical tests are given for when a local model is good enough and sampling should be moved to a new area. Our methods explicitly deal with the case where prediction accuracy requirements exist during exploration: By gradually shifting a "center of exploration" and controlling the speed of the shift with local prediction accuracy,a goal-directed exploration of state space takes place along the fringes of the current data support until the task goal is achieved.
Unsupervised Parallel Feature Extraction from First Principles
EE., Linkoping University S-58183 Linkoping Sweden Abstract We describe a number of learning rules that can be used to train unsupervised parallelfeature extraction systems. The learning rules are derived using gradient ascent of a quality function. We consider anumber of quality functions that are rational functions of higher order moments of the extracted feature values. We show that one system learns the principle components of the correlation matrix.Principal component analysis systems are usually not optimal feature extractors for classification. Therefore we design quality functions which produce feature vectors that support unsupervised classification.The properties of the different systems are compared with the help of different artificially designed datasets and a database consisting of all Munsell color spectra. 1 Introduction There are a number of unsupervised Hebbian learning algorithms (see Oja, 1992 and references therein) that perform some version of the Karhunen-Loeve expansion.