Statistical Learning
Clustering by compression
Cilibrasi, Rudi, Vitanyi, Paul
We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.
Metric learning pairwise kernel for graph inference
Vert, Jean-Philippe, Qiu, Jian, Noble, William Stafford
Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel (MLPK). We demonstrate, using several real biological networks, that this direct approach often improves upon the state-of-the-art SVM for indirect inference with the tensor product pairwise kernel.
A Novel Bayesian Classifier using Copula Functions
Pattern classification is an important task in several image processing, statistical learning, and data mining applications. The most popular pattern classifiers are Bayesian classifiers. There are many well known methods for represent ing Bayesian classifiers, but one of the most useful method is by discriminant functions . These functions provide inter-class decision surfaces for Bayesian classifier s. Discriminant functions assume several forms depending on the probability density of the feature space. But most attention has been received by discriminant functions that assume multivariate Gaussian distribution [1].
A Unified View of TD Algorithms; Introducing Full-Gradient TD and Equi-Gradient Descent TD
This paper addresses the issue of policy evaluation in Markov Decision Processes, using linear function approximation. It provides a unified view of algorithms such as TD(lambda), LSTD(lambda), iLSTD, residual-gradient TD. It is asserted that they all consist in minimizing a gradient function and differ by the form of this function and their means of minimizing it. Two new schemes are introduced in that framework: Full-gradient TD which uses a generalization of the principle introduced in iLSTD, and EGD TD, which reduces the gradient by successive equi-gradient descents. These three algorithms form a new intermediate family with the interesting property of making much better use of the samples than TD while keeping a gradient descent scheme, which is useful for complexity issues and optimistic policy iteration.
Nonlinear Estimators and Tail Bounds for Dimension Reduction in $l_1$ Using Cauchy Random Projections
Li, Ping, Hastie, Trevor J., Church, Kenneth W.
For dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies the original data matrix $\mathbf{A} \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples of the standard Cauchy C(0,1). Because of the impossibility results, one can not hope to recover the pairwise $l_1$ distances in $\mathbf{A}$ from $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times k}$, using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computation, information retrieval, learning, and data mining. We propose three types of nonlinear estimators: the bias-corrected sample median estimator, the bias-corrected geometric mean estimator, and the bias-corrected maximum likelihood estimator. The sample median estimator and the geometric mean estimator are asymptotically (as $k\to \infty$) equivalent but the latter is more accurate at small $k$. We derive explicit tail bounds for the geometric mean estimator and establish an analog of the Johnson-Lindenstrauss (JL) lemma for dimension reduction in $l_1$, which is weaker than the classical JL lemma for dimension reduction in $l_2$. Asymptotically, both the sample median estimator and the geometric mean estimators are about 80% efficient compared to the maximum likelihood estimator (MLE). We analyze the moments of the MLE and propose approximating the distribution of the MLE by an inverse Gaussian.
Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering
The k -modes algorithm [1] extends the k -means paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure for categorical objects, (2) modes instead of means for clusters, and (3) a frequency-based method to update modes in the k -means fashion to minimize the cost function of clustering. Because the k -modes algorithm uses the same clustering process as k -means, it preserves the efficiency of the k -means algorithm. Although the k -modes algorithm is very efficient, it suffers the problem that the clustering results are sensitive to the selection of the initial points. Hence, a better initial points selection procedure would improve the reliability and accuracy of clustering results. To that end, an iterative initial-points refinement algorithm for k -modes clustering has been presented in [2]. As shown in [2], the new initialization pr ocedure greatly improves the reliability and accuracy of final clustering results. Despite the su ccess of Ref. [2], the following observations motivate us to further pursue other alternative initialization methods.
Modular self-organization
This paper addresses the problem of building a long-living a utonomous agent; by long-living, we mean that this agent has a large number of relatively complex and varying tasks to perform. Biology sugge sts some ideas about the way animals deal with a variety of tasks: brains are made of specialized and complementary areas/modules; skills are spre ad over modules. On the one hand, distributing functions and representation s has immediate advantages: parallel processing implies reaction speed-u p; a relative independence between modules gives more robustness. Both prope rties might clearly increase the agent's efficiency. On the other hand, th e fact of distributing a system raises a fundamental issue: how does the o rganization process of the modules happen during the life-time? 1 There has been much research about the design of modular inte lligent architectures (see for instance [15] [5] [1] [7]). It is neve rtheless very often the (human) designer who decides the way modules are connect ed to each other and how they behave with respect to the others.
Building and displaying name relations using automatic unsupervised analysis of newspaper articles
Pouliquen, Bruno, Steinberger, Ralf, Ignat, Camelia, Oellinger, Tamara
We present a tool that, from automatically recognised names, tries to infer inter-person relations in order to present associated people on maps. Based on an in-house Named Entity Recognition tool, applied on clusters of an average of 15,000 news articles per day, in 15 different languages, we build a knowledge base that allows extracting statistical co-occurrences of persons and visualising them on a per-person page or in various graphs.
Navigating multilingual news collections using automatically extracted information
Steinberger, Ralf, Pouliquen, Bruno, Ignat, Camelia
We are presenting a text analysis tool set that allows analysts in various fields to sieve through large collections of multilingual news items quickly and to find information that is of relevance to them. For a given document collection, the tool set automatically clusters the texts into groups of similar articles, extracts names of places, people and organisations, lists the user-defined specialist terms found, links clusters and entities, and generates hyperlinks. Through its daily news analysis operating on thousands of articles per day, the tool also learns relationships between people and other entities. The fully functional prototype system allows users to explore and navigate multilingual document collections across languages and time.
Metric entropy in competitive on-line prediction
A typical result of competitive on-line prediction says that, for a giv en benchmark class of prediction strategies, there is a prediction strategy that performs almost as well as the best prediction strategies in the benchmark cla ss. For simplicity, in this paper the performance of a prediction strategy will be measured by the cumulative squared distance between its predictions a nd the true observations, assumed to be real (occasionally complex) numbers . Different methods of competitive on-line predictions (such as Gradient Desce nt, following the perturbed leader, strong and weak aggregating algorithms, d efensive forecasting, etc.) tend to have their narrow "area of expertise": eac h works well for benchmark classes of a specific "size" but is not readily applicable to c lasses of a different size. In this paper we will apply a simple general method based on metric ent ropy to benchmark classes of a wide range of sizes. Typically, this method does not give optimal results, but its results are often not much worse than those given by specialized methods, especially for benchmark classes that are not too massive.