Country
Gaussian Process Regression with Mismatched Models
I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning withexcessively strong smoothness assumptions can be particularly dangerous:For example, a student with a standard radial basis function covariance function will learn a rougher teacher function onlylogarithmically slowly. All predictions are confirmed by simulations. 1 Introduction There has in the last few years been a good deal of excitement about the use of Gaussian processes (GPs) as an alternative to feedforward networks [1]. GPs make prior assumptions about the problem to be learned very transparent, and even though they are nonparametric models, inference-at least in the case of regression considered below-is relatively straightforward. One crucial question for applications is then how'fast' GPs learn, i.e. how many training examples are needed to achieve a certain level of generalization performance.
Probabilistic principles in unsupervised learning of visual structure: human data and a model
Edelman, Shimon, Hiles, Benjamin P., Yang, Hwajin, Intrator, Nathan
To find out how the representations of structured visual objects depend on the co-occurrence statistics of their constituents, we exposed subjects to a set of composite images with tight control exerted over (1) the conditional probabilitiesof the constituent fragments, and (2) the value of Barlow's criterion of "suspicious coincidence" (the ratio of joint probability to the product of marginals). We then compared the part verification response timesfor various probe/target combinations before and after the exposure. For composite probes, the speedup was much larger for targets thatcontained pairs of fragments perfectly predictive of each other, compared to those that did not. This effect was modulated by the significance oftheir co-occurrence as estimated by Barlow's criterion. For lone-fragment probes, the speedup in all conditions was generally lower than for composites. These results shed light on the brain's strategies for unsupervised acquisition of structural information in vision.
Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
Amari, Shun-ichi, Park, Hyeyoung, Ozeki, Tomoko
Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection theory suchas AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estimator andthe Bayesian predictive distribution in terms of Gaussian random fields, by using simple models. 1 Introduction A neural network is specified by a number of parameters which are synaptic weights and biases. Learning takes place by modifying these parameters from observed input-output examples.
Dynamic Time-Alignment Kernel in Support Vector Machine
Shimodaira, Hiroshi, Noma, Ken-ichi, Nakai, Mitsuru, Sagayama, Shigeki
A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of nonlinear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs).
Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
Recurrent neural networks of analog units are computers for realvalued functions. We study the time complexity of real computation in general recurrent neural networks. These have sigmoidal, linear, and product units of unlimited order as nodes and no restrictions on the weights. For networks operating in discrete time, we exhibit a family of functions with arbitrarily high complexity, and we derive almost tight bounds on the time required to compute these functions. Thus, evidence is given of the computational limitations that time-bounded analog recurrent neural networks are subject to. 1 Introduction Analog recurrent neural networks are known to have computational capabilities that exceed those of classical Turing machines (see, e.g., Siegelmann and Sontag, 1995; Kilian and Siegelmann, 1996; Siegelmann, 1999).
A Model of the Phonological Loop: Generalization and Binding
O', Reilly, Randall C., Soto, R.
We present a neural network model that shows how the prefrontal cortex, interacting with the basal ganglia, can maintain a sequence of phonological information in activation-based working memory (i.e., the phonological loop). The primary function of this phonological may be to transiently encode arbitrary bindings ofloop information necessary for tasks - the combinatorial expressive power of language enables very flexible binding of essentially arbitrary pieces of information. Our model takes advantage of the closed-class nature of phonemes, which allows different neural representations of all possible phonemes at each sequential position to be encoded. To make this work, we suggest that the basal ganglia update signal that allocates phonemes toprovide a region-specific the appropriate sequential coding slot. To demonstrate that flexible, arbitrary binding of novel sequences can be supported by this we show that the model can generalize to novel sequencesmechanism, after moderate amounts of training.
A Parallel Mixture of SVMs for Very Large Scale Problems
Collobert, Ronan, Bengio, Samy, Bengio, Yoshua
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithmwhich is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1 Introduction Recently a lot of work has been done around Support Vector Machines [9], mainly due to their impressive generalization performances on classification problems when compared to other algorithms such as artificial neural networks [3, 6].
Intransitive Likelihood-Ratio Classifiers
Bilmes, Jeff, Ji, Gang, Meila, Marina
In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference betweenthe true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improvesthe classification results when tested on medium vocabulary speechrecognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament.Lastly, we find that intransitivity appears to be a good measure of classification confidence.