Statistical Learning
A Probabilistic Approach to Latent Cluster Analysis
Xie, Zhipeng (Fudan University) | Dong, Rui (Fudan University) | Deng, Zhengheng (Fudan University) | He, Zhenying (Fudan University) | Yang, Weidong (Fudan University)
Facing a large number of clustering solutions, cluster ensemble method provides an effective approach to aggregating them into a better one. In this paper, we propose a novel cluster ensemble method from probabilistic perspective. It assumes that each clustering solution is generated from a latent cluster model, under the control of two probabilistic parameters. Thus, the cluster ensemble problem is reformulated into an optimization problem of maximum likelihood. An EM-style algorithm is designed to solve this problem. It can determine the number of clusters automatically. Experimenal results have shown that the proposed algorithm outperforms the state-of-the-art methods including EAC-AL, CSPA, HGPA, and MCLA. Furthermore, it has been shown that our algorithm is stable in the predicted numbers of clusters.
A Theoretic Framework of K-Means-Based Consensus Clustering
Wu, Junjie (Beihang University) | Liu, Hongfu (Beihang University) | Xiong, Hui (Rutgers University) | Cao, Jie ( Nanjing University of Finance and Economics )
Consensus clustering emerges as a promising solution to find cluster structures from data. As an efficient approach for consensus clustering, the K-means based method has garnered attention in the literature, but the existing research is still preliminary and fragmented. In this paper, we provide a systematic study on the framework of K-means-based Consensus Clustering (KCC). We first formulate the general definition of KCC, and then reveal a necessary and sufficient condition for utility functions that work for KCC, on both complete and incomplete basic partitionings. Experimental results on various real-world data sets demonstrate that KCC is highly efficient and is comparable to the state-of-the-art methods in terms of clustering quality. In addition, KCC shows high robustness to incomplete basic partitionings with substantial missing values.
A KNN Based Kalman Filter Gaussian Process Regression
Wang, Yali (Laval University) | Chaib-draa, Brahim (Laval University)
The standard Gaussian process (GP) regression is often intractable when a data set is large or spatially nonstationary. In this paper, we address these challenging data properties by designing a novel K nearest neighbor based Kalman filter Gaussian process (KNN-KFGP) regression. Based on a state space model established by the KNN driven data grouping, our KNN-KFGP recursively filters out the latent function values in a computationally efficient and accurate Kalman filtering framework. Moreover, KNN allows each test point to find its strongly correlated local training subset, so our KNN-KFGP provides a suitable way to deal with spatial nonstationary problems. We evaluate the performance of our KNN-KFGP on several synthetic and real data sets to show its validity.
Large Scale Online Kernel Classification
Hoi, Steven C. H. (Nanyang Technological University) | Wang, Jialei (Nanyang Technological University) | Zhao, Peilin (Nanyang Technological University) | Zhuang, Jinfeng (Microsoft Corporation) | Liu, Zhi-yong (Chinese Academy of Sciences)
In this work, we present a new framework for large scale online kernel classification, making kernel methods efficient and scalable for large-scale online learning tasks. Unlike the regular budget kernel online learning scheme that usually uses different strategies to bound the number of support vectors, our framework explores a functional approximation approach to approximating a kernel function/matrix in order to make the subsequent online learning task efficient and scalable. Specifically, we present two different online kernel machine learning algorithms: (i) the Fourier Online Gradient Descent (FOGD) algorithm that applies the random Fourier features for approximating kernel functions; and (ii) the Nystrom Online Gradient Descent (NOGD) algorithm that applies the Nystrom method to approximate large kernel matrices. We offer theoretical analysis of the proposed algorithms, and conduct experiments for large-scale online classification tasks with some data set of over 1 million instances. Our encouraging results validate the effectiveness and efficiency of the proposed algorithms, making them potentially more practical than the family of existing budget kernel online learning approaches.
Multi-View Maximum Entropy Discrimination
Sun, Shiliang (East China Normal University) | Chao, Guoqing (East China Normal University)
Maximum entropy discrimination (MED) is a general framework for discriminative estimation based on the well known maximum entropy principle, which embodies the Bayesian integration of prior information with large margin constraints on observations. It is a successful combination of maximum entropy learning and maximum margin learning, and can subsume support vector machines (SVMs) as a special case. In this paper, we present a multi-view maximum entropy discrimination framework that is an extension of MED to the scenario of learning with multiple feature sets. Different from existing approaches to exploiting multiple views, such as co-training style algorithms and co-regularization style algorithms, we propose a new method to make use of the distinct views where classification margins from these views are required to be identical. We give the general form of the solution to the multi-view maximum entropy discrimination, and provide an instantiation under a specific prior formulation which is analogical to a multi-view version of SVMs. Experimental results on real-world data sets show the effectiveness of the proposed multi-view maximum entropy discrimination approach.
Unlearning from Demonstration
Sullivan, Keith (George Mason University) | ElMolla, Ahmed (George Mason University) | Squires, Bill (George Mason University) | Luke, Sean (George Mason University)
When doing learning from demonstration, it is often the case that the demonstrator provides corrective examples to fix errant behavior by the agent or robot. We present a set of algorithms which use this corrective data to identify and remove noisy examples in datasets which caused errant classifications, and ultimately errant behavior. The objective is to actually modify the source datasets rather than solely rely on the noise-insensitivity of the classification algorithm. This is particularly useful in the sparse datasets often found in learning from demonstration experiments. Our approach tries to distinguish between noisy misclassification and mere undersampling of the learning space. If errors are a result of misclassification, we potentially remove the responsible points and update the classifier. We demonstrate our method on UCI Machine Learning datasets at different levels of sparsity and noise, using decision trees, K-Nearest-Neighbor, and support vector machines.
Hartigan's K-Means Versus Lloyd's K-Means โ Is It Time for a Change?
Slonim, Noam (IBM Haifa Research Lab) | Aharoni, Ehud (IBM Haifa Research Lab) | Crammer, Koby (The Technion Department of Electrical Engineering)
Hartigan's method for k-means clustering holds several potential advantages compared to the classical and prevalent optimization heuristic known as Lloyd's algorithm. E.g., it was recently shown that the set of local minima of Hartigan's algorithm is a subset of those of Lloyd's method. We develop a closed-form expression that allows to establish Hartigan's method for k-means clustering with any Bregman divergence, and further strengthen the case of preferring Hartigan's algorithm over Lloyd's algorithm. Specifically, we characterize a range of problems with various noise levels of the inputs, for which any random partition represents a local minimum for Lloyd's algorithm, while Hartigan's algorithm easily converges to the correct solution. Extensive experiments on synthetic and real-world data further support our theoretical analysis.
Large-Scale Spectral Clustering on Graphs
Liu, Jialu (University of Illinois at Urbana-Champaign) | Wang, Chi (University of Illinois at Urbana-Champaign) | Danilevsky, Marina (University of Illinois at Urbana-Champaign) | Han, Jiawei (University of Illinois at Urbana-Champaign)
Graph clustering has received growing attention in recent years as an important analytical technique, both due to the prevalence of graph data, and the usefulness of graph structures for exploiting intrinsic data characteristics.However, as graph data grows in scale, it becomes increasingly more challenging to identify clusters. In this paper we propose an efficient clustering algorithm for large-scale graph data using spectral methods. The key idea is to repeatedly generate a small number of "supernodes" connected to the regular nodes, in order to compress the original graph into a sparse bipartite graph. By clustering the bipartite graph using spectral methods, we are able to greatly improve efficiency without losing considerable clustering power. Extensive experiments show the effectiveness and efficiency of our approach.
Learning Finite Beta-Liouville Mixture Models via Variational Bayes for Proportional Data Clustering
Fan, Wentao (Concordia University) | Bouguila, Nizar (Concordia University)
During the past decade, finite mixture modeling has become a well-established technique in data analysis and clustering. This paper focus on developing a variational inference framework to learn finite Beta-Liouville mixture models that have been proposed recently as an efficient way for proportional data clustering. In contrast to the conventional expectation maximization (EM) algorithm, commonly used for learning finite mixture models, the proposed algorithm has the advantages that it is more efficient from a computational point of view and by preventing over- and under-fitting problems. Moreover, the complexity of the mixture model (i.e. the number of components) can be determined automatically and simultaneously with the parameters estimation in a closed form as part of the Bayesian inference procedure. The merits of the proposed approach are shown using both artificial data sets and two interesting and challenging real applications namely dynamic textures clustering and facial expression recognition.
Dimensionality Reduction with Generalized Linear Models
Chen, Mo (The Chinese University of Hong Kong) | Li, Wei (The Chinese University of Hong Kong) | Wang, Xiaogang (The Chinese University of Hong Kong) | Zhang, Wei (The Chinese University of Hong Kong)
In this paper, we propose a general dimensionality reduction method for data generated from a very broad family of distributions and nonlinear functions based on the generalized linear model, called Generalized Linear Principal Component Analysis (GLPCA). Data of different domains often have very different structures. These data can be modeled by different distributions and reconstruction functions. For example, real valued data can be modeled by the Gaussian distribution with a linear reconstruction function, whereas binary valued data may be more appropriately modeled by the Bernoulli distribution with a logit or probit function. Based on general linear models, we propose a unified framework for extracting features from data of different domains. A general optimization algorithm based on natural gradient ascent on distribution manifold is proposed for obtaining the maximum likelihood solutions. We also present some specific algorithms derived from this framework to deal with specific data modeling problems such as document modeling. Experimental results of these algorithms on several data sets are shown for the validation of GLPCA.