Statistical Learning
Clustering of discretely observed diffusion processes
De Gregorio, Alessandro, Iacus, Stefano Maria
In this paper a new dissimilarity measure to identify groups of assets dynamics is proposed. The underlying generating process is assumed to be a diffusion process solution of stochastic differential equations and observed at discrete time. The mesh of observations is not required to shrink to zero. As distance between two observed paths, the quadratic distance of the corresponding estimated Markov operators is considered. Analysis of both synthetic data and real financial data from NYSE/NASDAQ stocks, give evidence that this distance seems capable to catch differences in both the drift and diffusion coefficients contrary to other commonly used metrics.
Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra
Bailer-Jones, C. A. L., Smith, K. W., Tiede, C., Sordo, R., Vallenari, A.
We develop and demonstrate a probabilistic method for classifying rare objects in surveys with the particular goal of building very pure samples. It works by modifying the output probabilities from a classifier so as to accommodate our expectation (priors) concerning the relative frequencies of different classes of objects. We demonstrate our method using the Discrete Source Classifier, a supervised classifier currently based on Support Vector Machines, which we are developing in preparation for the Gaia data analysis. DSC classifies objects using their very low resolution optical spectra. We look in detail at the problem of quasar classification, because identification of a pure quasar sample is necessary to define the Gaia astrometric reference frame. By varying a posterior probability threshold in DSC we can trade off sample completeness and contamination. We show, using our simulated data, that it is possible to achieve a pure sample of quasars (upper limit on contamination of 1 in 40,000) with a completeness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars have a frequency of only 1 in every 2000 objects. The star sample completeness is simultaneously 99% with a contamination of 0.7%. Including parallax and proper motion in the classifier barely changes the results. We further show that not accounting for class priors in the target population leads to serious misclassifications and poor predictions for sample completeness and contamination. (Truncated)
Finding links and initiators: a graph reconstruction problem
Mannila, Heikki, Terzi, Evimaria
Analyzing 0-1 matrices is one of the main themes in data mining. Techniques such as clustering or mixture modelling, matrix decomposition techniques such as PCA, ICA, and NMR, and Bayesian all aim to give an answer to the informal question: "Where does the matrix come from?" These approaches aim at describing a probabilistic generative model that describes the observed matrix well. In this paper we consider yet another way of answering the question "Where does a 0-1 matrix M come from?" In our model, the matrix M of size n m is considered to arise from initiators, certain few entries that are initially 1. The initiators propagate their 1's by following the links of a directed influence graph G (represented by an n n adjacency matrix). We denote the initiator matrix of size n m by N and we use G (of size n n) to refer both to the directed graph between the rows of M and as well as its adjacency matrix. Then, we believe that the structure of N and G can tell how a matrix M has been created.
Collective Classification in Network Data
Sen, Prithviraj (University of Maryland) | Namata, Galileo (University of Maryland) | Bilgic, Mustafa (University of Maryland) | Getoor, Lise (University of Maryland) | Galligher, Brian (University of Maryland) | Eliassi-Rad, Tina (University of Maryland)
Many real-world applications produce networked data such as the world-wide web (hypertext documents connected via hyperlinks), social networks (for example, people connected by friendship links), communication networks (computers connected via communication links) and biological networks (for example, protein interaction networks). A recent focus in machine learning research has been to extend traditional machine learning classification techniques to classify nodes in such networks. In this article, we provide a brief introduction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and empirically compare them on both synthetic and real-world data.
Normalized Information Distance
Vitanyi, Paul M. B., Balbach, Frank J., Cilibrasi, Rudi L., Li, Ming
The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.
Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
Predictive Hypothesis Identification
While statistics focusses on hypothesis testing and on estimating (properties of) the true sampling distribution, in machine learning the performance of learning algorithms on future data is the primary issue. In this paper we bridge the gap with a general principle (PHI) that identifies hypotheses with best predictive performance. This includes predictive point and interval estimation, simple and composite hypothesis testing, (mixture) model selection, and others as special cases. For concrete instantiations we will recover well-known methods, variations thereof, and new ones. PHI nicely justifies, reconciles, and blends (a reparametrization invariant variation of) MAP, ML, MDL, and moment estimation. One particular feature of PHI is that it can genuinely deal with nested hypotheses.
From Data to the p-Adic or Ultrametric Model
We model anomaly and change in data by embedding the data in an ultrametric space. Taking our initial data as cross-tabulation counts (or other input data formats), Correspondence Analysis allows us to endow the information space with a Euclidean metric. We then model anomaly or change by an induced ultrametric. The induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We apply this work to the flow of narrative expressed in the film script of the Casablanca movie; and to the evolution between 1988 and 2004 of the Colombian social conflict and violence.
The Correspondence Analysis Platform for Uncovering Deep Structure in Data and Information
We study two aspects of information semantics: (i) the collection of all relationships, (ii) tracking and spotting anomaly and change. The first is implemented by endowing all relevant information spaces with a Euclidean metric in a common projected space. The second is modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding of different information spaces based on cross-tabulation counts (and from other input data formats) is provided by Correspondence Analysis. From there, the induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We employ such a perspective to look at narrative, "the flow of thought and the flow of language" (Chafe). In application to policy decision making, we show how we can focus analysis in a small number of dimensions.
Uncertainty quantification in complex systems using approximate solvers
Koutsourelakis, Phaedon-Stelios
This paper proposes a novel uncertainty quantification framework for computationally demanding systems characterized by a large vector of non-Gaussian uncertainties. It combines state-of-the-art techniques in advanced Monte Carlo sampling with Bayesian formulations. The key departure from existing works is the use of inexpensive, approximate computational models in a rigorous manner. Such models can readily be derived by coarsening the discretization size in the solution of the governing PDEs, increasing the time step when integration of ODEs is performed, using fewer iterations if a non-linear solver is employed or making use of lower order models. It is shown that even in cases where the inexact models provide very poor approximations of the exact response, statistics of the latter can be quantified accurately with significant reductions in the computational effort. Multiple approximate models can be used and rigorous confidence bounds of the estimates produced are provided at all stages.