Goto

Collaborating Authors

 Europe


Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Neural Information Processing Systems

We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efficiently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical findings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels.


Incremental Algorithms for Hierarchical Classification

Neural Information Processing Systems

We study the problem of hierarchical classification when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classifier for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classifier with a flat SVM classifier and with classifiers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classifier performs, after just one training epoch, nearly as well as the hierarchical SVM classifier (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classifier based on the true model parameters.


Proximity Graphs for Clustering and Manifold Learning

Neural Information Processing Systems

Many machine learning algorithms for clustering or dimensionality reduction take as input a cloud of points in Euclidean space, and construct a graph with the input data points as vertices. This graph is then partitioned (clustering) or used to redefine metric information (dimensionality reduction). There has been much recent work on new methods for graph-based clustering and dimensionality reduction, but not much on constructing the graph itself. Graphs typically used include the fullyconnected graph, a local fixed-grid graph (for image segmentation) or a nearest-neighbor graph. We suggest that the graph should adapt locally to the structure of the data. This can be achieved by a graph ensemble that combines multiple minimum spanning trees, each fit to a perturbed version of the data set. We show that such a graph ensemble usually produces a better representation of the data manifold than standard methods; and that it provides robustness to a subsequent clustering or dimensionality reduction algorithm based on the graph.


Dependent Gaussian Processes

Neural Information Processing Systems

Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difficult to deal with multiple outputs, because ensuring that the covariance matrix is positive definite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs.


Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity

Neural Information Processing Systems

Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron, and synaptic depression when the presynaptic neuron fires shortly after. The dependence of synaptic modulation on the precise timing of the two action potentials is known as spike-timing dependent plasticity or STDP. We derive STDP from a simple computational principle: synapses adapt so as to minimize the postsynaptic neuron's variability to a given presynaptic input, causing the neuron's output to become more reliable in the face of noise. Using an entropy-minimization objective function and the biophysically realistic spike-response model of Gerstner (2001), we simulate neurophysiological experiments and obtain the characteristic STDP curve along with other phenomena including the reduction in synaptic plasticity as synaptic efficacy increases. We compare our account to other efforts to derive STDP from computational principles, and argue that our account provides the most comprehensive coverage of the phenomena. Thus, reliability of neural response in the face of noise may be a key goal of cortical adaptation.


Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis

Neural Information Processing Systems

In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data.


Exponentiated Gradient Algorithms for Large-margin Structured Classification

Neural Information Processing Systems

We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient--even in cases where the number of labels y is exponential in size--provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs.


Large-Scale Prediction of Disulphide Bond Connectivity

Neural Information Processing Systems

The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We first build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efficiently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain.


Co-Training and Expansion: Towards Bridging Theory and Practice

Neural Information Processing Systems

Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker "expansion" assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAClearning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs.


Breaking SVM Complexity with Cross-Training

Neural Information Processing Systems

We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages.