Goto

Collaborating Authors

 Country


An Adaptive Metric Machine for Pattern Classification

Neural Information Processing Systems

Nearest neighbor classification assumes locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with finite samples due to the curse of dimensionality. Severe bias can be introduced under these conditions when using the nearest neighbor rule. We propose a locally adaptive nearest neighbor classification method to try to minimize bias. We use a Chi-squared distance analysis to compute a flexible metric for producing neighborhoods that are elongated along less relevant feature dimensions and constricted along most influential ones. As a result, the class conditional probabilities tend to be smoother in the modified neighborhoods, whereby better classification performance can be achieved. The efficacy of our method is validated and compared against other techniques using a variety of real world data. 1 Introduction


Model Complexity, Goodness of Fit and Diminishing Returns

Neural Information Processing Systems

Such learning tasks can typically be characterized by the existence of a model and a loss function. A fitted model of complexity k is a function of the data points D and depends on a specific set of fitted parameters B. The loss function (goodnessof-fit) is a functional of the model and maps each specific model to a scalar used to evaluate the model, e.g., likelihood for density estimation or sum-of-squares for regression. Figure 1 illustrates a typical empirical curve for loss function versus complexity, for mixtures of Markov models fitted to a large data set of 900,000 sequences. The complexity k is the number of Markov models being used in the mixture (see Cadez et al. (2000) for further details on the model and the data set). The empirical curve has a distinctly concave appearance, with large relative gains in fit for low complexity models and much more modest relative gains for high complexity models.


A Tighter Bound for Graphical Models

Neural Information Processing Systems

The neurons in these networks are the random variables, whereas the connections between them model the causal dependencies. Usually, some of the nodes have a direct relation with the random variables in the problem and are called'visibles'. The other nodes, known as'hiddens', are used to model more complex probability distributions. Learning in graphical models can be done as long as the likelihood that the visibles correspond to a pattern in the data set, can be computed. In general the time it takes, scales exponentially with the number of hidden neurons.


Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Neural Information Processing Systems

People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a nonparametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation.


Stagewise Processing in Error-correcting Codes and Image Restoration

Neural Information Processing Systems

We introduce stagewise processing in error-correcting codes and image restoration, by extracting information from the former stage and using it selectively to improve the performance of the latter one. Both mean-field analysis using the cavity method and simulations show that it has the advantage of being robust against uncertainties in hyperparameter estimation. 1 Introduction In error-correcting codes [1] and image restoration [2], the choice of the so-called hyperparameters is an important factor in determining their performances. Hyperparameters refer to the coefficients weighing the biases and variances of the tasks. In error correction, they determine the statistical significance given to the paritychecking terms and the received bits. Similarly in image restoration, they determine the statistical weights given to the prior knowledge and the received data.


The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving

Neural Information Processing Systems

Although connectionist models have provided insights into the nature of perception and motor control, connectionist accounts of higher cognition seldom go beyond an implementation of traditional symbol-processing theories. We describe a connectionist constraint satisfaction model of how people solve anagram problems. The model exploits statistics of English orthography, but also addresses the interplay of sub symbolic and symbolic computation by a mechanism that extracts approximate symbolic representations (partial orderings of letters) from sub symbolic structures and injects the extracted representation back into the model to assist in the solution of the anagram. We show the computational benefit of this extraction-injection process and discuss its relationship to conscious mental processes and working memory. We also account for experimental data concerning the difficulty of anagram solution based on the orthographic structure of the anagram string and the target word.


An Adaptive Metric Machine for Pattern Classification

Neural Information Processing Systems

Nearest neighbor classification assumes locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with finite samples due to the curse of dimensionality. Severe bias can be introduced under these conditions when using the nearest neighbor rule. We propose a locally adaptive nearest neighbor classification method to try to minimize bias. We use a Chi-squared distance analysis to compute a flexible metric for producing neighborhoods that are elongated along less relevant feature dimensions and constricted along most influential ones. As a result, the class conditional probabilities tend to be smoother in the modified neighborhoods, whereby better classification performance can be achieved. The efficacy of our method is validated and compared against other techniques using a variety of real world data. 1 Introduction


The Kernel Trick for Distances

Neural Information Processing Systems

A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms. 1 Introduction One of the crucial ingredients of SVMs is the so-called kernel trick for the computation of dot products in high-dimensional feature spaces using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example [13, 8]. Although the mathematical result underlying the kernel trick is almost a century old [6], it was only much later [1, 3,13] that it was made fruitful for the machine learning community. Kernel methods have since led to interesting generalizations of learning algorithms and to successful real-world applications. The present paper attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. Again, the underlying mathematical results, mainly due to Schoenberg, have been known for a while [7]; some of them have already attracted interest in the kernel methods community in various contexts [11, 5, 15].


The Use of Classifiers in Sequential Inference

Neural Information Processing Systems

We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.


The Unscented Particle Filter

Neural Information Processing Systems

In this paper, we propose a new particle filter based on sequential importance sampling. The algorithm uses a bank of unscented filters to obtain the importance proposal distribution. This proposal has two very "nice" properties. Firstly, it makes efficient use of the latest available information and, secondly, it can have heavy tails. As a result, we find that the algorithm outperforms standard particle filtering and other nonlinear filtering methods very substantially.