Statistical Learning
Using Model-based Overlapping Seed Expansion to detect highly overlapping community structure
McDaid, Aaron F., Hurley, Neil J.
As research into community finding in social networks progresses, there is a need for algorithms capable of detecting overlapping community structure. Many algorithms have been proposed in recent years that are capable of assigning each node to more than a single community. The performance of these algorithms tends to degrade when the ground-truth contains a more highly overlapping community structure, with nodes assigned to more than two communities. Such highly overlapping structure is likely to exist in many social networks, such as Facebook friendship networks. In this paper we present a scalable algorithm, MOSES, based on a statistical model of community structure, which is capable of detecting highly overlapping community structure, especially when there is variance in the number of communities each node is in. In evaluation on synthetic data MOSES is found to be superior to existing algorithms, especially at high levels of overlap. We demonstrate MOSES on real social network data by analyzing the networks of friendship links between students of five US universities.
Supervised Random Walks: Predicting and Recommending Links in Social Networks
Predicting the occurrence of links is a fundamental problem in networks. In the link prediction problem we are given a snapshot of a network and would like to infer which interactions among existing members are likely to occur in the near future or which existing interactions are we missing. Although this problem has been extensively studied, the challenge of how to effectively combine the information from the network structure with rich node and edge attribute data remains largely open. We develop an algorithm based on Supervised Random Walks that naturally combines the information from the network structure with node and edge level attributes. We achieve this by using these attributes to guide a random walk on the graph. We formulate a supervised learning task where the goal is to learn a function that assigns strengths to edges in the network such that a random walker is more likely to visit the nodes to which new links will be created in the future. We develop an efficient training algorithm to directly learn the edge strength estimation function. Our experiments on the Facebook social graph and large collaboration networks show that our approach outperforms state-of-the-art unsupervised approaches as well as approaches that are based on feature extraction.
PADDLE: Proximal Algorithm for Dual Dictionaries LEarning
Basso, Curzio, Santoro, Matteo, Verri, Alessandro, Villa, Silvia
The representation of a signal as the superposition of elementary signals, or atoms, is the pillar of a number of research fields and analysis techniques. The best-known example of such methods is the Fourier transform, where the atoms form an orthonormal basis and every signal has a unique representation. Although an orthonormal basis would seem the most natural choice for decomposing a signal, overcomplete dictionaries (or frames) are nowadays commonplace and their use is both theoretically justified and supported by experimentally successful applications [1]. Tight frames are a class of overcomplete dictionaries with the particular property of ensuring that the optimal representation can still be recovered, as with orthonormal bases, by means of inner products of the signal and the dictionary. The goal of this paper is to introduce an algorithm - that we called PADDLE - capable of learning from data a dictionary endowed with properties similar to that of tight frames.
Characterization of differentially expressed genes using high-dimensional co-expression networks
de Abreu, Gabriel C. G., Labouriau, Rodrigo
We present a technique to characterize differentially expressed genes in terms of their position in a high-dimensional co-expression network. The set-up of Gaussian graphical models is used to construct representations of the co-expression network in such a way that redundancy and the propagation of spurious information along the network are avoided. The proposed inference procedure is based on the minimization of the Bayesian Information Criterion (BIC) in the class of decomposable graphical models. This class of models can be used to represent complex relationships and has suitable properties that allow to make effective inference in problems with high degree of complexity (e.g. several thousands of genes) and small number of observations (e.g. 10-100) as typically occurs in high throughput gene expression studies. Taking advantage of the internal structure of decomposable graphical models, we construct a compact representation of the co-expression network that allows to identify the regions with high concentration of differentially expressed genes. It is argued that differentially expressed genes located in highly interconnected regions of the co-expression network are less informative than differentially expressed genes located in less interconnected regions. Based on that idea, a measure of uncertainty that resembles the notion of relative entropy is proposed. Our methods are illustrated with three publically available data sets on microarray experiments (the larger involving more than 50,000 genes and 64 patients) and a short simulation study.
Learning Planar Ising Models
Johnson, Jason K., Netrapalli, Praneeth, Chertkov, Michael
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus our attention on the class of planar Ising models, for which inference is tractable using techniques of statistical physics [Kac and Ward; Kasteleyn]. Based on these techniques and recent methods for planarity testing and planar embedding [Chrobak and Payne], we propose a simple greedy algorithm for learning the best planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. We demonstrate our method in some simulations and for the application of modeling senate voting records.
Brain covariance selection: better individual functional connectivity models using population prior
Varoquaux, Gaรซl, Gramfort, Alexandre, Poline, Jean Baptiste, Thirion, Bertrand
Spontaneous brain activity, as observed in functional neuroimaging, has been shown to display reproducible structure that expresses brain architecture and carries markers of brain pathologies. An important view of modern neuroscience is that such large-scale structure of coherent activity reflects modularity properties of brain connectivity graphs. However, to date, there has been no demonstration that the limited and noisy data available in spontaneous activity observations could be used to learn full-brain probabilistic models that generalize to new data. Learning such models entails two main challenges: i) modeling full brain connectivity is a difficult estimation problem that faces the curse of dimensionality and ii) variability between subjects, coupled with the variability of functional signals between experimental runs, makes the use of multiple datasets challenging. We describe subject-level brain functional connectivity structure as a multivariate Gaussian process and introduce a new strategy to estimate it from group data, by imposing a common structure on the graphical model in the population. We show that individual models learned from functional Magnetic Resonance Imaging (fMRI) data using this population prior generalize better to unseen data than models based on alternative regularization schemes. To our knowledge, this is the first report of a cross-validated model of spontaneous brain activity. Finally, we use the estimated graphical model to explore the large-scale characteristics of functional architecture and show for the first time that known cognitive networks appear as the integrated communities of functional connectivity graph.
Balanced Reduction of Nonlinear Control Systems in Reproducing Kernel Hilbert Space
Bouvrie, Jake, Hamzi, Boumediene
We introduce a novel data-driven order reduction method for nonlinear control systems, drawing on recent progress in machine learning and statistical dimensionality reduction. The method rests on the assumption that the nonlinear system behaves linearly when lifted into a high (or infinite) dimensional feature space where balanced truncation may be carried out implicitly. This leads to a nonlinear reduction map which can be combined with a representation of the system belonging to a reproducing kernel Hilbert space to give a closed, reduced order dynamical system which captures the essential input-output characteristics of the original model. Empirical simulations illustrating the approach are also provided.
Stability of Density-Based Clustering
Rinaldo, Alessandro, Singh, Aarti, Nugent, Rebecca, Wasserman, Larry
High density clusters can be characterized by the connected components of a level set $L(\lambda) = \{x:\ p(x)>\lambda\}$ of the underlying probability density function $p$ generating the data, at some appropriate level $\lambda\geq 0$. The complete hierarchical clustering can be characterized by a cluster tree ${\cal T}= \bigcup_{\lambda} L(\lambda)$. In this paper, we study the behavior of a density level set estimate $\widehat L(\lambda)$ and cluster tree estimate $\widehat{\cal{T}}$ based on a kernel density estimator with kernel bandwidth $h$. We define two notions of instability to measure the variability of $\widehat L(\lambda)$ and $\widehat{\cal{T}}$ as a function of $h$, and investigate the theoretical properties of these instability measures.
Exact block-wise optimization in group lasso and sparse group lasso for linear regression
The group lasso is a penalized regression method, used in regression problems where the covariates are partitioned into groups to promote sparsity at the group level. Existing methods for finding the group lasso estimator either use gradient projection methods to update the entire coefficient vector simultaneously at each step, or update one group of coefficients at a time using an inexact line search to approximate the optimal value for the group of coefficients when all other groups' coefficients are fixed. We present a new method of computation for the group lasso in the linear regression case, the Single Line Search (SLS) algorithm, which operates by computing the exact optimal value for each group (when all other coefficients are fixed) with one univariate line search. We perform simulations demonstrating that the SLS algorithm is often more efficient than existing computational methods. We also extend the SLS algorithm to the sparse group lasso problem via the Signed Single Line Search (SSLS) algorithm, and give theoretical results to support both algorithms.
Transposable regularized covariance models with an application to missing data imputation
Allen, Genevera I., Tibshirani, Robert
Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so-called transposable regularized covariance models allow for maximum likelihood estimation of the mean and nonsingular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.