Goto

Collaborating Authors

 Technology


Data Clustering by Markovian Relaxation and the Information Bottleneck Method

Neural Information Processing Systems

We introduce a new, nonparametric and principled, distance based clustering method. This method combines a pairwise based approach witha vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during thisrelaxation, and then are extracted using the information bottleneck method.


Sparse Kernel Principal Component Analysis

Neural Information Processing Systems

'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisationof the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation intoa feature space wherein standard PCA is performed. Unfortunately, thetechnique is not'sparse', since the components thus obtained are expressed in terms of kernels associated with every trainingvector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors,using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1 Introduction Principal component analysis (PCA) is a well-established technique for dimensionality reduction,and examples of its many applications include data compression, image processing, visualisation, exploratory data analysis, pattern recognition and time series prediction.


Kernel Expansions with Unlabeled Examples

Neural Information Processing Systems

Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance.We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trainedvia the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulationof the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requiresvery few labeled examples for high classification accuracy.


A Mathematical Programming Approach to the Kernel Fisher Algorithm

Neural Information Processing Systems

We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD).A mathematical programming formulation based on the observation thatKFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm.


Active Support Vector Machine Classification

Neural Information Processing Systems

Classificationis achieved by a linear or nonlinear separating surface in the input space of the dataset. In this work we propose a very fast simple algorithm, based on an active set strategy for solving quadratic programs with bounds [18]. The algorithm is capable of accurately solving problems with millions of points and requires nothing more complicated than a commonly available linear equation solver [17, 1, 6] for a typically small (100) dimensional input space of the problem. Key to our approach are the following two changes to the standard linear SVM: 1. Maximize the margin (distance) between the parallel separating planes with respect to both orientation (w) as well as location relative to the origin b).


Constrained Independent Component Analysis

Neural Information Processing Systems

The paper presents a novel technique of constrained independent component analysis (CICA) to introduce constraints into the classical ICAand solve the constrained optimization problem by using Lagrange multiplier methods. This paper shows that CICA can be used to order the resulted independent components in a specific manner and normalize the demixing matrix in the signal separation procedure. It can systematically eliminate the ICA's indeterminacy on permutation and dilation. The experiments demonstrate the use of CICA in ordering of independent components while providing normalized demixing processes. Keywords: Independent component analysis, constrained independent componentanalysis, constrained optimization, Lagrange multiplier methods 1 Introduction Independent component analysis (ICA) is a technique to transform a multivariate randomsignal into a signal with components that are mutually independent in complete statistical sense [1]. There has been a growing interest in research for efficient realization of ICA neural networks (ICNNs).


Text Classification using String Kernels

Neural Information Processing Systems

HUlna Lodhi John Shawe-Taylor N ello Cristianini Chris Watkins Department of Computer Science Royal Holloway, University of London Egham, Surrey TW20 OEX, UK {huma, john, nello, chrisw}Cdcs.rhbnc.ac.uk Abstract We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence ofk characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation ofthis feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique.


Large Scale Bayes Point Machines

Neural Information Processing Systems

The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently beendemonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisationabilities. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simpleand is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digitswhich show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine.


`N-Body' Problems in Statistical Learning

Neural Information Processing Systems

We present efficient algorithms for all-point-pairs problems, or'Nbody'-like problems,which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection, and the two-point correlation.


Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

Neural Information Processing Systems

Forexample, in medical diagnosis, the presence of a symptom can be expressed as a noisy-OR of the diseases that may cause the symptom - on some occasions, a disease may fail to activate the symptom. Inference in richly-connected noisy-OR networks is intractable, butapproximate methods (e .g., variational techniques) are showing increasing promise as practical solutions. One problem withmost approximations is that they tend to concentrate on a relatively small number of modes in the true posterior, ignoring otherplausible configurations of the hidden variables. We introduce a new sequential variational method for bipartite noisy OR networks, that favors including all modes of the true posterior and models the posterior distribution as a tree. We compare this method with other approximations using an ensemble of networks with network statistics that are comparable to the QMR-DT medical diagnosticnetwork. 1 Inclusive variational approximations Approximate algorithms for probabilistic inference are gaining in popularity and are now even being incorporated into VLSI hardware (T.