Statistical Learning
Combining Neural Network Regression Estimates with Regularized Linear Weights
Merz, Christopher J., Pazzani, Michael J.
When combining a set of learned models to form an improved estimator, the issue of redundancy or multicollinearity in the set of models must be addressed. A progression of existing approaches and their limitations with respect to the redundancy is discussed. A new approach, PCR *, based on principal components regression is proposed to address these limitations. An evaluation of the new approach on a collection of domains reveals that: 1) PCR* was the most robust combination method as the redundancy of the learned models increased, 2) redundancy could be handled without eliminating any of the learned models, and 3) the principal components of the learned models provided a continuum of "regularized" weights from which PCR * could choose.
Unsupervised Learning by Convex and Conic Coding
Lee, Daniel D., Seung, H. Sebastian
Unsupervised learning algorithms based on convex and conic encoders are proposed. The encoders find the closest convex or conic combination of basis vectors to the input. The learning algorithms produce basis vectors that minimize the reconstruction error of the encoders. The convex algorithm develops locally linear models of the input, while the conic algorithm discovers features. Both algorithms are used to model handwritten digits and compared with vector quantization and principal component analysis.
Combinations of Weak Classifiers
To obtain classification systems with both good generalization performance and efficiency in space and time, we propose a learning method based on combinations of weak classifiers, where weak classifiers are linear classifiers (perceptrons) which can do a little better than making random guesses. A randomized algorithm is proposed to find the weak classifiers. They· are then combined through a majority vote. As demonstrated through systematic experiments, the method developed is able to obtain combinations of weak classifiers with good generalization performance and a fast training time on a variety of test problems and real applications.
One-unit Learning Rules for Independent Component Analysis
Neural one-unit learning rules for the problem of Independent Component Analysis (ICA) and blind source separation are introduced. In these new algorithms, every ICA neuron develops into a separator that finds one of the independent components. The learning rules use very simple constrained Hebbianjanti-Hebbian learning in which decorrelating feedback may be added. To speed up the convergence of these stochastic gradient descent rules, a novel computationally efficient fixed-point algorithm is introduced. 1 Introduction Independent Component Analysis (ICA) (Comon, 1994; Jutten and Herault, 1991) is a signal processing technique whose goal is to express a set of random variables as linear combinations of statistically independent component variables. The main applications of ICA are in blind source separation, feature extraction, and blind deconvolution.
Adaptively Growing Hierarchical Mixtures of Experts
Fritsch, Jürgen, Finke, Michael, Waibel, Alex
We propose a novel approach to automatically growing and pruning Hierarchical Mixtures of Experts. The constructive algorithm proposed here enables large hierarchies consisting of several hundred experts to be trained effectively. We show that HME's trained by our automatic growing procedure yield better generalization performance than traditional static and balanced hierarchies. Evaluation of the algorithm is performed (1) on vowel classification and (2) within a hybrid version of the JANUS r9] speech recognition system using a subset of the Switchboard large-vocabulary speaker-independent continuous speech recognition database.
Limitations of Self-organizing Maps for Vector Quantization and Multidimensional Scaling
SaM can be said to do clustering/vector quantization (VQ) and at the same time to preserve the spatial ordering of the input data reflected by an ordering of the code book vectors (cluster centroids) in a one or two dimensional output space, where the latter property is closely related to multidimensional scaling (MDS) in statistics. Although the level of activity and research around the SaM algorithm is quite large (a recent overview by [Kohonen 95] contains more than 1000 citations), only little comparison among the numerous existing variants of the basic approach and also to more traditional statistical techniques of the larger frameworks of VQ and MDS is available. Additionally, there is only little advice in the literature about how to properly use 446 A. Flexer SOM in order to get optimal results in terms of either vector quantization (VQ) or multidimensional scaling or maybe even both of them. To make the notion of SOM being a tool for "data visualization" more precise, the following question has to be answered: Should SOM be used for doing VQ, MDS, both at the same time or none of them? Two recent comprehensive studies comparing SOM either to traditional VQ or MDS techniques separately seem to indicate that SOM is not competitive when used for either VQ or MDS: [Balakrishnan et al. 94J compare SOM to K-means clustering on 108 multivariate normal clustering problems with known clustering solutions and show that SOM performs significantly worse in terms of data points misclassified
On a Modification to the Mean Field EM Algorithm in Factorial Learning
Dunmur, A. P., Titterington, D. M.
A modification is described to the use of mean field approximations in the E step of EM algorithms for analysing data from latent structure models, as described by Ghahramani (1995), among others. The modification involves second-order Taylor approximations to expectations computed in the E step. The potential benefits of the method are illustrated using very simple latent profile models.
Estimating Equivalent Kernels for Neural Networks: A Data Perturbation Approach
The perturbation method which we have presented overcomes the limitations of standard approaches, which are only appropriate for models with a single layer of adjustable weights, albeit at considerable computational expense. It has the added bonus of automatically taking into account the effect of regularisation techniques such as weight decay. The experimental results illustrate the application of the technique to two simple problems. As expected the number of degrees of freedom in the models is found to be related to the amount of weight decay used during training. The equivalent kernels are found to vary significantly in different regions of input space and the functions reconstructed from the estimated smoother matrices closely match the origna!
Improving the Accuracy and Speed of Support Vector Machines
Burges, Christopher J. C., Schölkopf, Bernhard
Support Vector Learning Machines (SVM) are finding application in pattern recognition, regression estimation, and operator inversion for ill-posed problems. Against this very general backdrop, any methods for improving the generalization performance, or for improving the speed in test phase, of SVMs are of increasing interest. In this paper we combine two such techniques on a pattern recognition problem. The method for improving generalization performance (the "virtual support vector" method) does so by incorporating known invariances of the problem. This method achieves a drop in the error rate on 10,000 NIST test digit images of 1.4% to 1.0%.