Statistical Learning
A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning
Bellet, Aurélien, Liang, Yingyu, Garakani, Alireza Bagheri, Balcan, Maria-Florina, Sha, Fei
Learning sparse combinations is a frequent theme in machine learning. In this paper, we study its associated optimization problem in the distributed setting where the elements to be combined are not centrally located but spread over a network. We address the key challenges of balancing communication costs and optimization errors. To this end, we propose a distributed Frank-Wolfe (dFW) algorithm. We obtain theoretical guarantees on the optimization error $\epsilon$ and communication cost that do not depend on the total number of combining elements. We further show that the communication cost of dFW is optimal by deriving a lower-bound on the communication cost required to construct an $\epsilon$-approximate solution. We validate our theoretical analysis with empirical studies on synthetic and real-world data, which demonstrate that dFW outperforms both baselines and competing methods. We also study the performance of dFW when the conditions of our analysis are relaxed, and show that dFW is fairly robust.
SPRITE: A Response Model For Multiple Choice Testing
Ning, Ryan, Waters, Andrew E., Studer, Christoph, Baraniuk, Richard G.
Item response theory (IRT) models for categorical response data are widely used in the analysis of educational data, computerized adaptive testing, and psychological surveys. However, most IRT models rely on both the assumption that categories are strictly ordered and the assumption that this ordering is known a priori. These assumptions are impractical in many real-world scenarios, such as multiple-choice exams where the levels of incorrectness for the distractor categories are often unknown. While a number of results exist on IRT models for unordered categorical data, they tend to have restrictive modeling assumptions that lead to poor data fitting performance in practice. Furthermore, existing unordered categorical models have parameters that are difficult to interpret. In this work, we propose a novel methodology for unordered categorical IRT that we call SPRITE (short for stochastic polytomous response item model) that: (i) analyzes both ordered and unordered categories, (ii) offers interpretable outputs, and (iii) provides improved data fitting compared to existing models. We compare SPRITE to existing item response models and demonstrate its efficacy on both synthetic and real-world educational datasets.
Entropic one-class classifiers
Livi, Lorenzo, Sadeghian, Alireza, Pedrycz, Witold
The one-class classification problem is a well-known research endeavor in pattern recognition. The problem is also known under different names, such as outlier and novelty/anomaly detection. The core of the problem consists in modeling and recognizing patterns belonging only to a so-called target class. All other patterns are termed non-target, and therefore they should be recognized as such. In this paper, we propose a novel one-class classification system that is based on an interplay of different techniques. Primarily, we follow a dissimilarity representation based approach; we embed the input data into the dissimilarity space by means of an appropriate parametric dissimilarity measure. This step allows us to process virtually any type of data. The dissimilarity vectors are then represented through a weighted Euclidean graphs, which we use to (i) determine the entropy of the data distribution in the dissimilarity space, and at the same time (ii) derive effective decision regions that are modeled as clusters of vertices. Since the dissimilarity measure for the input data is parametric, we optimize its parameters by means of a global optimization scheme, which considers both mesoscopic and structural characteristics of the data represented through the graphs. The proposed one-class classifier is designed to provide both hard (Boolean) and soft decisions about the recognition of test patterns, allowing an accurate description of the classification process. We evaluate the performance of the system on different benchmarking datasets, containing either feature-based or structured patterns. Experimental results demonstrate the effectiveness of the proposed technique.
Learning the Conditional Independence Structure of Stationary Time Series: A Multitask Learning Approach
E consider a stationary discrete-time vector process or time series. Such a process could model, e.g., the time evolution of air pollutant concentrations [1], [2] or medical diagnostic data obtained in electrocorticography (ECoG) [3]. One specific way of representing the dependence structure of a vector process is via a graphical model [4], where the nodes of the graph represent the individual scalar process components, and the edges represent statistical relations between the individual process components. More precisely, the (undirected) edges of a conditional independence graph (CIG) associated with a process represent conditional independence statements about the process components [4], [1]. In particular, two nodes in the CIG are connected by an edge if and only if the two corresponding process components are conditionally dependent, given the remaining process components. Note that the so defined CIG for time series extends the basic notion of a CIG for random vectors by considering dependencies between entire time series instead of dependencies between scalar random variables [5], [6]. In this work, we investigate the problem of graphical model selection (GMS), i.e., that of inferring the CIG of a time series, given a finite-length observation. A. Jung is with the Institute of Telecommunications, Vienna University of Technology, 1040-Vienna, Austria email: ajung@nt.tuwien.ac.at.
Techniques for clustering interaction data as a collection of graphs
Lee, Nam H., Priebe, Carey, Park, Youngser, Wang, I-Jeng, Rosen, Michael
A natural approach to analyze interaction data of form "what-connects-to-what-when" is to create a time-series (or rather a sequence) of graphs through temporal discretization (bandwidth selection) and spatial discretization (vertex contraction). Such discretization together with non-negative factorization techniques can be useful for obtaining clustering of graphs. Motivating application of performing clustering of graphs (as opposed to vertex clustering) can be found in neuroscience and in social network analysis, and it can also be used to enhance community detection (i.e., vertex clustering) by way of conditioning on the cluster labels. In this paper, we formulate a problem of clustering of graphs as a model selection problem. Our approach involves information criteria, non-negative matrix factorization and singular value thresholding, and we illustrate our techniques using real and simulated data.
Survey schemes for stochastic gradient descent with applications to M-estimation
Clémençon, Stéphan, Bertail, Patrice, Chautru, Emilie, Papa, Guillaume
In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the "full data" statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling with unequal inclusion probabilities on stochastic gradient descent-based M-estimation methods in large-scale statistical and machine-learning problems. Precisely, we prove that, in presence of some a priori information, one may significantly increase asymptotic accuracy when choosing appropriate first order inclusion probabilities, without affecting complexity. These striking results are described here by limit theorems and are also illustrated by numerical experiments.
Exploring Sparsity in Multi-class Linear Discriminant Analysis
Recent studies in the literature have paid much attention to the sparsity in linear classification tasks. One motivation of imposing sparsity assumption on the linear discriminant direction is to rule out the noninformative features, making hardly contribution to the classification problem. Most of those work were focused on the scenarios of binary classification, such as Fan et al. (2012), Cai and Liu (2011) and Mai et al. (2012). In the presence of multi-class data, preceding researches recommended individually pairwise sparse linear discriminant analysis(LDA), such as Cai and Liu(2011),Fan et al.(2012). However, further sparsity should be explored. In this paper, an estimator of grouped LASSO type is proposed to take advantage of sparsity for multi-class data. It enjoys appealing non-asymptotic properties which allows insignificant correlations among features. This estimator exhibits superior capability on both simulated and real data.
Co-clustering for directed graphs: the Stochastic co-Blockmodel and spectral algorithm Di-Sim
Directed graphs have asymmetric connections, yet the current graph clustering methodologies cannot identify the potentially global structure of these asymmetries. We give a spectral algorithm called di-sim that builds on a dual measure of similarity that correspond to how a node (i) sends and (ii) receives edges. Using di-sim, we analyze the global asymmetries in the networks of Enron emails, political blogs, and the c elegans neural connectome. In each example, a small subset of nodes have persistent asymmetries; these nodes send edges with one cluster, but receive edges with another cluster. Previous approaches would have assigned these asymmetric nodes to only one cluster, failing to identify their sending/receiving asymmetries. Regularization and "projection" are two steps of di-sim that are essential for spectral clustering algorithms to work in practice. The theoretical results show that these steps make the algorithm weakly consistent under the degree corrected Stochastic co-Blockmodel, a model that generalizes the Stochastic Blockmodel to allow for both (i) degree heterogeneity and (ii) the global asymmetries that we intend to detect. The theoretical results make no assumptions on the smallest degree nodes. Instead, the theorem requires that the average degree grows sufficiently fast and that the weak consistency only applies to the subset of the nodes with sufficiently large leverage scores. The results results also apply to bipartite graphs.
An Introduction to Matrix Concentration Inequalities
In recent years, random matrices have come to play a major role in computational mathematics, but most of the classical areas of random matrix theory remain the province of experts. Over the last decade, with the advent of matrix concentration inequalities, research has advanced to the point where we can conquer many (formerly) challenging problems with a page or two of arithmetic. My aim is to describe the most successful methods from this area along with some interesting examples that these techniques can illuminate. I hope that the results in these pages will inspire future work on applications of random matrices as well as refinements of the matrix concentration inequalities discussed herein. I have chosen to present a coherent body of results based on a generalization of the Laplace transform method for establishing scalar concentration inequalities. In the last two years, Lester Mackey and I, together with our coauthors, have developed an alternative approach to matrix concentration using exchangeable pairs and Markov chain couplings. With some regret, I have chosen to omit this theory because the ideas seem less accessible to a broad audience of researchers. The interested reader will find pointers to these articles in the annotated bibliography. The work described in these notes reflects the influence of many researchers.
An Effective Semi-supervised Divisive Clustering Algorithm
Diverse experimental data ranging from microarray gene expression data in biology to spectrum data in astronomy require to be clustered to signal meaningful correlation of the data. Massive documents or images on internet are also needed to be effectively organized so as to promote the efficiency of search engines. Clustering method as K-means (1) is popular for its simplicity, yet sensitive to noise and initialization and thus is limited by the lack of reliability. Hierarchical clustering (HC) (2) is simple and intuitive and thus widely used especially in biology (3), whereas it needs a large computation (4) and its result is variable to a set of similarity measures between clusters. Moreover, the cluster number for the above methods needs to be prespecified (e.g., K-means) or determined by a threshold (e.g., HC). Some other well-known algorithms either involve complex optimization and postprocessing (5), or have limited range of applications such as the distribution (6) or the attribute of data (7, 8). Although affinity propagation (AP) (9) has much better performance than K-means and the cluster number is determined automatically, it is not good at detecting nonspherical clusters (10). Recently, two effective clustering algorithms (10, 11) were proposed, which can together form a pool of clustering methods based on the in-tree structure (11). But they involve a free parameter.