Data Science
Using Tarjan's Red Rule for Fast Dependency Tree Construction
We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain is a probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan's red-edge rule, which is generally considered a guaranteed recipe for bad performance.
Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
Vert, Jean-philippe, Kanehisa, Minoru
We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genes known to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles into kernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Function prediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.
Going Metric: Denoising Pairwise Data
Roth, Volker, Laub, Julian, Müller, Klaus-Robert, Buhmann, Joachim M.
Pairwise data in empirical sciences typically violate metricity, either dueto noise or due to fallible estimates, and therefore are hard to analyze by conventional machine learning technology. In this paper we therefore study ways to work around this problem. First, we present an alternative embedding to multidimensional scaling (MDS) that allows us to apply a variety of classical machine learningand signal processing algorithms. The class of pairwise grouping algorithms which share the shift-invariance property is statistically invariant under this embedding procedure, leading to identical assignments of objects to clusters. Based on this new vectorial representation, denoising methods are applied in a second step.Both steps provide a theoretically well controlled setup to translate from pairwise data to the respective denoised metric representation.We demonstrate the practical usefulness of our theoretical reasoning by discovering structure in protein sequence data bases, visibly improving performance upon existing automatic methods. 1 Introduction Unsupervised grouping or clustering aims at extracting hidden structure from data (see e.g.
Artefactual Structure from Least-Squares Multidimensional Scaling
Hughes, Nicholas P., Lowe, David
We consider the problem of illusory or artefactual structure from the visualisation ofhigh-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical field of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SSTRESS measure) gives rise to an annular structure when the input data is drawn from a highdimensional isotropicdistribution, and we provide a theoretical justification for this observation.
Using Tarjan's Red Rule for Fast Dependency Tree Construction
We focus on the problem of efficient learning of dependency trees. It is well-known that given the pairwise mutual information coefficients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construction ofthe correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we maintain isa probabilistic confidence interval on the coefficients, which we derive by examining just a small sample of the data. The algorithm is able to flag the need to shrink an interval, which translates to inspection ofmore data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, without significantloss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan's red-edge rule, which is generally considered a guaranteed recipe for bad performance.
Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
Vert, Jean-philippe, Kanehisa, Minoru
We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genesknown to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles intokernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Functionprediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.
Fast Transformation-Invariant Factor Analysis
Kannan, Anitha, Jojic, Nebojsa, Frey, Brendan
Dimensionality reduction techniques such as principal component analysis andfactor analysis are used to discover a linear mapping between high dimensional data samples and points in a lower dimensional subspace. In [6], Jojic and Frey introduced mixture of transformation-invariant component analyzers (MTCA) that can account for global transformations suchas translations and rotations, perform clustering and learn local appearance deformations by dimensionality reduction.
Robust Novelty Detection with Single-Class MPM
Ghaoui, Laurent E., Jordan, Michael I., Lanckriet, Gert R.
This algorithm-the "single-class minimax probability machine(MPM)"- is built on a distribution-free methodology that minimizes the worst-case probability of a data point falling outside of a convex set, given only the mean and covariance matrix of the distribution and making no further distributional assumptions. Wepresent a robust approach to estimating the mean and covariance matrix within the general two-class MPM setting, and show how this approach specializes to the single-class problem. We provide empirical results comparing the single-class MPM to the single-class SVM and a two-class SVM method. 1 Introduction Novelty detection is an important unsupervised learning problem in which test data are to be judged as having been generated from the same or a different process as that which generated the training data.