Statistical Learning
Constructing a Non-Negative Low Rank and Sparse Graph with Data-Adaptive Features
Zhuang, Liansheng, Gao, Shenghua, Tang, Jinhui, Wang, Jingjing, Lin, Zhouchen, Ma, Yi
This paper aims at constructing a good graph for discovering intrinsic data structures in a semi-supervised learning setting. Firstly, we propose to build a non-negative low-rank and sparse (referred to as NNLRS) graph for the given data representation. Specifically, the weights of edges in the graph are obtained by seeking a nonnegative low-rank and sparse matrix that represents each data sample as a linear combination of others. The so-obtained NNLRS-graph can capture both the global mixture of subspaces structure (by the low rankness) and the locally linear structure (by the sparseness) of the data, hence is both generative and discriminative. Secondly, as good features are extremely important for constructing a good graph, we propose to learn the data embedding matrix and construct the graph jointly within one framework, which is termed as NNLRS with embedded features (referred to as NNLRS-EF). Extensive experiments on three publicly available datasets demonstrate that the proposed method outperforms the state-of-the-art graph construction method by a large margin for both semi-supervised classification and discriminative analysis, which verifies the effectiveness of our proposed method.
Structured Low-Rank Matrix Factorization with Missing and Grossly Corrupted Observations
Shang, Fanhua, Liu, Yuanyuan, Tong, Hanghang, Cheng, James, Cheng, Hong
Recovering low-rank and sparse matrices from incomplete or corrupted observations is an important problem in machine learning, statistics, bioinformatics, computer vision, as well as signal and image processing. In theory, this problem can be solved by the natural convex joint/mixed relaxations (i.e., l_{1}-norm and trace norm) under certain conditions. However, all current provable algorithms suffer from superlinear per-iteration cost, which severely limits their applicability to large-scale problems. In this paper, we propose a scalable, provable structured low-rank matrix factorization method to recover low-rank and sparse matrices from missing and grossly corrupted data, i.e., robust matrix completion (RMC) problems, or incomplete and grossly corrupted measurements, i.e., compressive principal component pursuit (CPCP) problems. Specifically, we first present two small-scale matrix trace norm regularized bilinear structured factorization models for RMC and CPCP problems, in which repetitively calculating SVD of a large-scale matrix is replaced by updating two much smaller factor matrices. Then, we apply the alternating direction method of multipliers (ADMM) to efficiently solve the RMC problems. Finally, we provide the convergence analysis of our algorithm, and extend it to address general CPCP problems. Experimental results verified both the efficiency and effectiveness of our method compared with the state-of-the-art methods.
Crowd Labeling: a survey
Muhammadi, Jafar, Rabiee, Hamid Reza, Hosseini, Abbas
Recently, there has been a burst in the number of research projects on human computation via crowdsourcing. Multiple choice (or labeling) questions could be referred to as a common type of problem which is solved by this approach. As an application, crowd labeling is applied to find true labels for large machine learning datasets. Since crowds are not necessarily experts, the labels they provide are rather noisy and erroneous. This challenge is usually resolved by collecting multiple labels for each sample, and then aggregating them to estimate the true label. Although the mechanism leads to high-quality labels, it is not actually cost-effective. As a result, efforts are currently made to maximize the accuracy in estimating true labels, while fixing the number of acquired labels. This paper surveys methods to aggregate redundant crowd labels in order to estimate unknown true labels. It presents a unified statistical latent model where the differences among popular methods in the field correspond to different choices for the parameters of the model. Afterwards, algorithms to make inference on these models will be surveyed. Moreover, adaptive methods which iteratively collect labels based on the previously collected labels and estimated models will be discussed. In addition, this paper compares the distinguished methods, and provides guidelines for future work required to address the current open issues.
Axiomatic Construction of Hierarchical Clustering in Asymmetric Networks
Carlsson, Gunnar, Mรฉmoli, Facundo, Ribeiro, Alejandro, Segarra, Santiago
This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods for the determination of hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter, induced by the given dissimilarity structures. Our construction of hierarchical clustering methods is based on defining admissible methods to be those methods that abide by the axioms of value - nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them - and transformation - when dissimilarities are reduced, the network may become more clustered but not less. Several admissible methods are constructed and two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Alternative clustering methodologies and axioms are further considered. Allowing the outcome of hierarchical clustering to be asymmetric, so that it matches the asymmetry of the original data, leads to the inception of quasi-clustering methods. The existence of a unique quasi-clustering method is shown. Allowing clustering in a two-node network to proceed at the minimum of the two dissimilarities generates an alternative axiomatic construction. There is a unique clustering method in this case too. The paper also develops algorithms for the computation of hierarchical clusters using matrix powers on a min-max dioid algebra and studies the stability of the methods proposed. We proved that most of the methods introduced in this paper are such that similar networks yield similar hierarchical clustering results. Algorithms are exemplified through their application to networks describing internal migration within states of the United States (U.S.) and the interrelation between sectors of the U.S. economy.
Breakdown Point of Robust Support Vector Machine
Kanamori, Takafumi, Fujiwara, Shuhei, Takeda, Akiko
The support vector machine (SVM) is one of the most successful learning methods for solving classification problems. Despite its popularity, SVM has a serious drawback, that is sensitivity to outliers in training samples. The penalty on misclassification is defined by a convex loss called the hinge loss, and the unboundedness of the convex loss causes the sensitivity to outliers. To deal with outliers, robust variants of SVM have been proposed, such as the robust outlier detection algorithm and an SVM with a bounded loss called the ramp loss. In this paper, we propose a robust variant of SVM and investigate its robustness in terms of the breakdown point. The breakdown point is a robustness measure that is the largest amount of contamination such that the estimated classifier still gives information about the non-contaminated data. The main contribution of this paper is to show an exact evaluation of the breakdown point for the robust SVM. For learning parameters such as the regularization parameter in our algorithm, we derive a simple formula that guarantees the robustness of the classifier. When the learning parameters are determined with a grid search using cross validation, our formula works to reduce the number of candidate search points. The robustness of the proposed method is confirmed in numerical experiments. We show that the statistical properties of the robust SVM are well explained by a theoretical analysis of the breakdown point.
Measures of Entropy from Data Using Infinitely Divisible Kernels
Giraldo, Luis G. Sanchez, Rao, Murali, Principe, Jose C.
Information theory provides principled ways to analyze different inference and learning problems such as hypothesis testing, clustering, dimensionality reduction, classification, among others. However, the use of information theoretic quantities as test statistics, that is, as quantities obtained from empirical data, poses a challenging estimation problem that often leads to strong simplifications such as Gaussian models, or the use of plug in density estimators that are restricted to certain representation of the data. In this paper, a framework to non-parametrically obtain measures of entropy directly from data using operators in reproducing kernel Hilbert spaces defined by infinitely divisible kernels is presented. The entropy functionals, which bear resemblance with quantum entropies, are defined on positive definite matrices and satisfy similar axioms to those of Renyi's definition of entropy. Convergence of the proposed estimators follows from concentration results on the difference between the ordered spectrum of the Gram matrices and the integral operators associated to the population quantities. In this way, capitalizing on both the axiomatic definition of entropy and on the representation power of positive definite kernels, the proposed measure of entropy avoids the estimation of the probability distribution underlying the data. Moreover, estimators of kernel-based conditional entropy and mutual information are also defined. Numerical experiments on independence tests compare favourably with state of the art.
Multi-task Sparse Structure Learning
Goncalves, Andre R., Das, Puja, Chatterjee, Soumyadeep, Sivakumar, Vidyashankar, Von Zuben, Fernando J., Banerjee, Arindam
Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously. While sometimes the underlying task relationship structure is known, often the structure needs to be estimated from data at hand. In this paper, we present a novel family of models for MTL, applicable to regression and classification problems, capable of learning the structure of task relationships. In particular, we consider a joint estimation problem of the task relationship structure and the individual task parameters, which is solved using alternating minimization. The task relationship structure learning component builds on recent advances in structure learning of Gaussian graphical models based on sparse estimators of the precision (inverse covariance) matrix. We illustrate the effectiveness of the proposed model on a variety of synthetic and benchmark datasets for regression and classification. We also consider the problem of combining climate model outputs for better projections of future climate, with focus on temperature in South America, and show that the proposed model outperforms several existing methods for the problem.
A Plug&Play P300 BCI Using Information Geometry
Barachant, Alexandre, Congedo, Marco
Abstract--This paper presents a new classification methods for Event Related Potentials (ERP) based on an Information geometry framework. Through a new estimation of covariance matrices, this work extend the use of Riemannian geometry, which was previously limited to SMR-based BCI, to the problem of classification of ERPs. As compared to the state-of-the-art, this new method increases performance, reduces the number of data needed for the calibration and features good generalisation across sessions and subjects. This method is illustrated on data recorded with the P300-based game brain invaders. Finally, an online and adaptive implementation is described, where the BCI is initialized with generic parameters derived from a database and continiously adapt to the individual, allowing the user to play the game without any calibration while keeping a high accuracy. So far we have conceived a Brain-Computer Interface (BCI) as a learning machine where the classifier is trained in a calibration phase preceding immediately the actual BCI use [1]. Depending on the BCI paradigm and on the efficiency of the classifier, the calibration phase may last from a few to several minutes. Regardless the duration, the very necessity of a calibration session reduces drastically the usability and appealing of a BCI. This is true both for clinically-oriented BCI, where the cognitive skills of patients are often limited and are wasted in the calibration phase, and for healthy users where the plug&play operation is nowadays considered as a minimum requirement for any consumer interfaces and devices. Besides the essential considerations from the user perspective, it appears evident that training the BCI at the beginning of each session and discarding the calibration data at the end is a very inefficient way to proceed. The problem we pose here is: can we design a "plug&play" BCI? Of course, such a goal does not imply that the BCI is not calibrated.
Sensing Subjective Well-being from Social Media
Hao, Bibo, Li, Lin, Gao, Rui, Li, Ang, Zhu, Tingshao
Subjective Well-being(SWB), which refers to how people experience the quality of their lives, is of great use to public policy-makers as well as economic, sociological research, etc. Traditionally, the measurement of SWB relies on time-consuming and costly self-report questionnaires. Nowadays, people are motivated to share their experiences and feelings on social media, so we propose to sense SWB from the vast user generated data on social media. By utilizing 1785 users' social media data with SWB labels, we train machine learning models that are able to "sense" individual SWB from users' social media. Our model, which attains the state-by-art prediction accuracy, can then be used to identify SWB of large population of social media users in time with very low cost.
Falsifiable implies Learnable
To what extent are theory-based predictions justified by prior observations? The question is known as the problem of induction and is fundamental to scientific inference. We address the problem of induction from the perspective of learning theory. That is, we consider which theories, and under what assumptions, can be applied to make optimal predictions. Our main result is that the more hypotheses a theory falsifies, suitably quantified, the closer the predictive performance of the best strategy (based on the theory) will be to the theory's post hoc explanatory performance on observed data.