Statistical Learning
Compressive Network Analysis
Jiang, Xiaoye, Yao, Yuan, Liu, Han, Guibas, Leonidas
Modern data acquisition routinely produces massive amounts of network data. Though many methods and models have been proposed to analyze such data, the research of network data is largely disconnected with the classical theory of statistical learning and signal processing. In this paper, we present a new framework for modeling network data, which connects two seemingly different areas: network data analysis and compressed sensing. From a nonparametric perspective, we model an observed network using a large dictionary. In particular, we consider the network clique detection problem and show connections between our formulation with a new algebraic tool, namely Randon basis pursuit in homogeneous spaces. Such a connection allows us to identify rigorous recovery conditions for clique detection problems. Though this paper is mainly conceptual, we also develop practical approximation algorithms for solving empirical problems and demonstrate their usefulness on real-world datasets.
Margin-adaptive model selection in statistical learning
Arlot, Sylvain, Bartlett, Peter L.
A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Actually, we consider a weaker version of this condition that allows one to take into account that learning within a small model can be much easier than within a large one. Requiring this "strong margin adaptivity" makes the model selection problem more challenging. We first prove, in a general framework, that some penalization procedures (including local Rademacher complexities) exhibit this adaptivity when the models are nested. Contrary to previous results, this holds with penalties that only depend on the data. Our second main result is that strong margin adaptivity is not always possible when the models are not nested: for every model selection procedure (even a randomized one), there is a problem for which it does not demonstrate strong margin adaptivity.
Robust Clustering Using Outlier-Sparsity Regularization
Forero, Pedro A., Kekatos, Vassilis, Giannakis, Georgios B.
Notwithstanding the popularity of conventional clustering algorithms such as K-means and probabilistic clustering, their clustering results are sensitive to the presence of outliers in the data. Even a few outliers can compromise the ability of these algorithms to identify meaningful hidden structures rendering their outcome unreliable. This paper develops robust clustering algorithms that not only aim to cluster the data, but also to identify the outliers. The novel approaches rely on the infrequent presence of outliers in the data which translates to sparsity in a judiciously chosen domain. Capitalizing on the sparsity in the outlier domain, outlier-aware robust K-means and probabilistic clustering approaches are proposed. Their novelty lies on identifying outliers while effecting sparsity in the outlier domain through carefully chosen regularization. A block coordinate descent approach is developed to obtain iterative algorithms with convergence guarantees and small excess computational complexity with respect to their non-robust counterparts. Kernelized versions of the robust clustering algorithms are also developed to efficiently handle high-dimensional data, identify nonlinearly separable clusters, or even cluster objects that are not represented by vectors. Numerical tests on both synthetic and real datasets validate the performance and applicability of the novel algorithms.
Fast redshift clustering with the Baire (ultra) metric
Murtagh, Fionn, Contreras, Pedro
If X is endowed with a metric, then this metric can be mapped onto an ultrametric. In practice, endowing X with a metric can be relaxed to a dissimilarity. An often used mapping from metric to ultrametric is by means of an agglomerative hierarchical clustering algorithm. A succession of n 1 pairwise merge steps take place by making use of the closest pair of singletons and/or clusters at each step. Here n is the number of observations, i.e. the cardinality of set X. Closeness between singletons is furnished by whatever distance or dissimilarity is in use. For closeness between singleton or non-singleton clusters, we need to define an inter-cluster distance or dissimilarity. This can be defined with reference to the cluster compactness or other property that we wish to optimize at each step of the algorithm. Since agglomerative hierarchical clustering requires consideration of pairwise dissimilarities at each stage it can be shown that even in the case of the most efficient algorithms, e.g.
Simultaneous model-based clustering and visualization in the Fisher discriminative subspace
Bouveyron, Charles, Brunet, Camille
Clustering in high-dimensional spaces is nowadays a recurrent problem in many scientific domains but remains a difficult task from both the clustering accuracy and the result understanding points of view. This paper presents a discriminative latent mixture (DLM) model which fits the data in a latent orthonormal discriminative subspace with an intrinsic dimension lower than the dimension of the original space. By constraining model parameters within and between groups, a family of 12 parsimonious DLM models is exhibited which allows to fit onto various situations. An estimation algorithm, called the Fisher-EM algorithm, is also proposed for estimating both the mixture parameters and the discriminative subspace. Experiments on simulated and real datasets show that the proposed approach performs better than existing clustering methods while providing a useful representation of the clustered data. The method is as well applied to the clustering of mass spectrometry data.
High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections
Cocco, Simona, Monasson, Remi, Sessak, Vitor
Understanding the patterns of correlations between the components of complex systems is a fundamental issue in various scientific fields, ranging from neurobiology to genomic, from finance to sociology,... A recurrent problem is to distinguish between direct correlations, produced by physiological or functional interactions between the components, and network correlations, which are mediated by other, third-party components. Various approaches have been proposed to infer interactions from correlations, exploiting concepts related to statistical dimensional reduction [1], causality [2], the maximum entropy principle [3], Markov random fields [4]... A major practical and theoretical difficulty in doing so is the paucity and the quality of data: reliable analysis should be able to unveil real patterns of interactions, even if measures are affected by under-or noisy sampling. The size of the interaction network can be comparable to or larger than the number of data, a situation referred to as highdimensional inference. The purpose of the present work is to establish a quantitative correspondence between two of those approaches, namely the inference of Boltzmann Machines (also called Ising model in statistical physics and undirected graphical models for discrete variables in statistical inference [4]) and Principal Component Analysis (PCA) [1]. Inverse Boltzmann Machines (BM) are a mathematically well-founded but computationally challenging approach to infer interactions from correlations.
An expert system for detecting automobile insurance fraud using social network analysis
Šubelj, Lovro, Furlan, Štefan, Bajec, Marko
The article proposes an expert system for detection, and subsequent investigation, of groups of collaborating automobile insurance fraudsters. The system is described and examined in great detail, several technical difficulties in detecting fraud are also considered, for it to be applicable in practice. Opposed to many other approaches, the system uses networks for representation of data. Networks are the most natural representation of such a relational domain, allowing formulation and analysis of complex relations between entities. Fraudulent entities are found by employing a novel assessment algorithm, Iterative Assessment Algorithm (IAA), also presented in the article. Besides intrinsic attributes of entities, the algorithm explores also the relations between entities. The prototype was evaluated and rigorously analyzed on real world data. Results show that automobile insurance fraud can be efficiently detected with the proposed system and that appropriate data representation is vital. Key words: Fraud detection, Automobile insurance, Social network analysis, Link analysis, Assessment propagation 1. Introduction Fraud is encountered in a variety of domains. It comes in all different shapes and sizes, from traditional fraud, e.g. Such groups can be found in the automobile insurance domain. Here fraudsters stage traffic accidents and issue fake insurance claims to gain (unjustified) funds from their general or vehicle insurance. There are also cases where an accident has never occurred, and the vehicles have only been placed onto the road. Still, the majority of such fraud is not planned (opportunistic fraud) - an individual only seizes the opportunity arising from the accident and issues exaggerated insurance claims or claims for past damages. Staged accidents have several common characteristics. They occur in late hours and non-urban areas in order to reduce the probability of witnesses. Drivers are usually younger males, there are many passengers in the vehicles, but never children or elders. The police is always called to the scene to make the subsequent acquisition of means easier. It is also not uncommon that all of the participants have multiple (serious) injuries, when there is almost no damage on the vehicles. Many other suspicious characteristics exist, not mentioned here.
Metamodel-based importance sampling for the simulation of rare events
Dubourg, V., Deheeger, F., Sudret, B.
In the field of structural reliability, the Monte-Carlo estimator is considered as the reference probability estimator. However, it is still untractable for real engineering cases since it requires a high number of runs of the model. In order to reduce the number of computer experiments, many other approaches known as reliability methods have been proposed. A certain approach consists in replacing the original experiment by a surrogate which is much faster to evaluate. Nevertheless, it is often difficult (or even impossible) to quantify the error made by this substitution. In this paper an alternative approach is developed. It takes advantage of the kriging meta-modeling and importance sampling techniques. The proposed alternative estimator is finally applied to a finite element based structural reliability analysis.
Bayesian inference for queueing networks and modeling of internet services
Sutton, Charles, Jordan, Michael I.
Modern Internet services, such as those at Google, Yahoo!, and Amazon, handle billions of requests per day on clusters of thousands of computers. Because these services operate under strict performance requirements, a statistical understanding of their performance is of great practical interest. Such services are modeled by networks of queues, where each queue models one of the computers in the system. A key challenge is that the data are incomplete, because recording detailed information about every request to a heavily used system can require unacceptable overhead. In this paper we develop a Bayesian perspective on queueing models in which the arrival and departure times that are not observed are treated as latent variables. Underlying this viewpoint is the observation that a queueing model defines a deterministic transformation between the data and a set of independent variables called the service times. With this viewpoint in hand, we sample from the posterior distribution over missing data and model parameters using Markov chain Monte Carlo. We evaluate our framework on data from a benchmark Web application. We also present a simple technique for selection among nested queueing models. We are unaware of any previous work that considers inference in networks of queues in the presence of missing data.
Slicing: Nonsingular Estimation of High Dimensional Covariance Matrices Using Multiway Kronecker Delta Covariance Structures
Nonsingular estimation of high dimensional covariance matrices is an important step in many statistical procedures like classification, clustering, variable selection an future extraction. After a review of the essential background material, this paper introduces a technique we call slicing for obtaining a nonsingular covariance matrix of high dimensional data. Slicing is essentially assuming that the data has Kronecker delta covariance structure. Finally, we discuss the implications of the results in this paper and provide an example of classification for high dimensional gene expression data.