Plotting

 Country


Sparsity of Data Representation of Optimal Kernel Machine and Leave-one-out Estimator

Neural Information Processing Systems

Vapnik's result that the expectation of the generalisation error ofthe optimal hyperplane is bounded by the expectation of the ratio of the number of support vectors to the number of training examples is extended to a broad class of kernel machines. The class includes Support Vector Machines for soft margin classification and regression, and Regularization Networks with a variety of kernels and cost functions. We show that key inequalities in Vapnik's result become equalities once "the classification error" is replaced by "the margin error", with the latter defined as an instance with positive cost. In particular we show that expectations of the true margin error and the empirical margin error are equal, and that the sparse solutions for kernel machines are possible only if the cost function is "partially" insensitive. 1 Introduction Minimization of regularized risk is a backbone of several recent advances in machine learning, including Support Vector Machines (SVM) [13], Regularization Networks (RN) [5] or Gaussian Processes [15]. Such a machine is typically implemented as a weighted sum of a kernel function evaluated for pairs composed of a data vector in question and a number of selected training vectors, so called support vectors.


Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

Neural Information Processing Systems

Two well known classes of unsupervised procedures that can be cast in this manner are generative and recoding models. In a generative unsupervised framework, the environment generates training exampleswhich we will refer to as observations-by sampling from one distribution; the other distribution is embodied in the model. Examples of generative frameworks are mixtures of Gaussians (MoG) [2], factor analysis [4], and Boltzmann machines [8]. In the recoding unsupervised framework, the model transforms points from an obser- vation space to an output space, and the output distribution is compared either to a reference distribution or to a distribution derived from the output distribution. An example is independent component analysis (leA) [11], a method that discovers a representation of vector-valued observations in which the statistical dependence among the vector elements in the output space is minimized.


Natural Sound Statistics and Divisive Normalization in the Auditory System

Neural Information Processing Systems

We explore the statistical properties of natural sound stimuli preprocessed with a bank of linear filters. The responses of such filters exhibit a striking form of statistical dependency, in which the response variance of each filter grows with the response amplitude of filters tuned for nearby frequencies. These dependencies may be substantially reduced using an operation known as divisive normalization, in which the response of each filter is divided by a weighted sum of the rectified responses of other filters. The weights may be chosen to maximize the independence of the normalized responses for an ensemble of natural sounds. We demonstrate that the resulting model accounts for nonlinearities in the response characteristics of the auditory nerve, by comparing model simulations to electrophysiological recordings.


Some New Bounds on the Generalization Error of Combined Classifiers

Neural Information Processing Systems

In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee.


Automatic Choice of Dimensionality for PCA

Neural Information Processing Systems

A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.


Structure Learning in Human Causal Induction

Neural Information Processing Systems

We use graphical models to explore the question of how people learn simple causal relationships from data. The two leading psychological theories can both be seen as estimating the parameters of a fixed graph. We argue that a complete account of causal induction should also consider how people learn the underlying causal graph structure, and we propose to model this inductive process as a Bayesian inference. Our argument is supported through the discussion of three data sets. 1 Introduction Causality plays a central role in human mental life. Our behavior depends upon our understanding of the causal structure of our environment, and we are remarkably good at inferring causation from mere observation. Constructing formal models of causal induction is currently a major focus of attention in computer science [7], psychology [3,6], and philosophy [5]. This paper attempts to connect these literatures, by framing the debate between two major psychological theories in the computational language of graphical models. We show that existing theories equate human causal induction with maximum likelihood parameter estimation on a fixed graphical structure, and we argue that to fully account for human behavioral data, we must also postulate that people make Bayesian inferences about the underlying causal graph structure itself.


Kernel Expansions with Unlabeled Examples

Neural Information Processing Systems

Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.


Competition and Arbors in Ocular Dominance

Neural Information Processing Systems

Hebbian and competitive Hebbian algorithms are almost ubiquitous in modeling pattern formation in cortical development. We analyse in theoretical detail a particular model (adapted from Piepenbrock & Obermayer, 1999) for the development of Id stripe-like patterns, which places competitive and interactive cortical influences, and free and restricted initial arborisation onto a common footing. 1 Introduction Cats, many species of monkeys, and humans exibit ocular dominance stripes, which are alternating areas of primary visual cortex devoted to input from (the thalamic relay associated with) just one or the other eye (see Erwin et aI, 1995; Miller, 1996; Swindale, 1996 for reviews of theory and data). These well-known fingerprint patterns have been a seductive target for models of cortical pattern formation because of the mix of competition and cooperation they suggest. A wealth of synaptic adaptation algorithms has been suggested to account for them (and also the concomitant refinement of the topography of the map between the eyes and the cortex), many of which are based on forms of Hebbian learning. Critical issues for the models are the degree of correlation between inputs from the eyes, the nature of the initial arborisation of the axonal inputs, the degree and form of cortical competition, and the nature of synaptic saturation (preventing weights from changing sign or getting too large) and normalisation (allowing cortical and/or thalamic cells to support only a certain total synaptic weight).


Feature Correspondence: A Markov Chain Monte Carlo Approach

Neural Information Processing Systems

When trying to recover 3D structure from a set of images, the most difficult problem is establishing the correspondence between the measurements. Most existing approaches assume that features can be tracked across frames, whereas methods that exploit rigidity constraints to facilitate matching do so only under restricted camera motion. In this paper we propose a Bayesian approach that avoids the brittleness associated with singling out one "best" correspondence, and instead consider the distribution over all possible correspondences. We treat both a fully Bayesian approach that yields a posterior distribution, and a MAP approach that makes use of EM to maximize this posterior. We show how Markov chain Monte Carlo methods can be used to implement these techniques in practice, and present experimental results on real data.


Minimum Bayes Error Feature Selection for Continuous Speech Recognition

Neural Information Processing Systems

Two avenues will be explored: the first is to maximize the ()-average divergence between the class densities and the second is to minimize the union Bhattacharyya bound in the range of (). While both approaches yield similar performance in practice, they outperform standard LDA features and show a 10% relative improvement in the word error rate over state-of-the-art cepstral features on a large vocabulary telephony speech recognition task. 1 Introduction Modern speech recognition systems use cepstral features characterizing the short-term spectrum of the speech signal for classifying frames into phonetic classes. These features are augmented with dynamic information from the adjacent frames to capture transient spectral events in the signal.