co-association matrix
Solving weakly supervised regression problem using low-rank manifold regularization
Berikov, Vladimir, Litvinenko, Alexander
We solve a weakly supervised regression problem. Under "weakly" we understand that for some training points the labels are known, for some unknown, and for others uncertain due to the presence of random noise or other reasons such as lack of resources. The solution process requires to optimize a certain objective function (the loss function), which combines manifold regularization and low-rank matrix decomposition techniques. These low-rank approximations allow us to speed up all matrix calculations and reduce storage requirements. This is especially crucial for large datasets. Ensemble clustering is used for obtaining the co-association matrix, which we consider as the similarity matrix. The utilization of these techniques allows us to increase the quality and stability of the solution. In the numerical section, we applied the suggested method to artificial and real datasets using Monte-Carlo modeling.
Clustering Ensemble Meets Low-rank Tensor Approximation
Jia, Yuheng, Liu, Hui, Hou, Junhui, Zhang, Qingfu
This paper explores the problem of clustering ensemble, which aims to combine multiple base clusterings to produce better performance than that of the individual one. The existing clustering ensemble methods generally construct a co-association matrix, which indicates the pairwise similarity between samples, as the weighted linear combination of the connective matrices from different base clusterings, and the resulting co-association matrix is then adopted as the input of an off-the-shelf clustering algorithm, e.g., spectral clustering. However, the co-association matrix may be dominated by poor base clusterings, resulting in inferior performance. In this paper, we propose a novel low-rank tensor approximation-based method to solve the problem from a global perspective. Specifically, by inspecting whether two samples are clustered to an identical cluster under different base clusterings, we derive a coherent-link matrix, which contains limited but highly reliable relationships between samples. We then stack the coherent-link matrix and the co-association matrix to form a three-dimensional tensor, the low-rankness property of which is further explored to propagate the information of the coherent-link matrix to the co-association matrix, producing a refined co-association matrix. We formulate the proposed method as a convex constrained optimization problem and solve it efficiently. Experimental results over 7 benchmark data sets show that the proposed model achieves a breakthrough in clustering performance, compared with 12 state-of-the-art methods. To the best of our knowledge, this is the first work to explore the potential of low-rank tensor on clustering ensemble, which is fundamentally different from previous approaches.
Multi-objective Consensus Clustering Framework for Flight Search Recommendation
Chatterjee, Sujoy, Pasquier, Nicolas, Nanty, Simon, Zuluaga, Maria A.
To provide personalized recommendations for travel searches, an appropriate segmentation of customers is required. Clustering ensemble approaches were developed to overcome well-known problems of classical clustering approaches, that each rely on a different theoretical model and can thus identify in the data space only clusters corresponding to this model. Clustering ensemble approaches combine multiple clustering results, each from a different algorithmic configuration, for generating more robust consensus clusters corresponding to agreements between initial clusters. We present a new clustering ensemble multi-objective optimization-based framework developed for analyzing Amadeus customer search data and improve personalized recommendations. This framework optimizes diversity in the clustering ensemble search space and automatically determines an appropriate number of clusters without requiring user's input. Experimental results compare the efficiency of this approach with other existing approaches on Amadeus customer search data in terms of internal (Adjusted Rand Index) and external (Amadeus business metric) validations.
Consensus Clustering: An Embedding Perspective, Extension and Beyond
Liu, Hongfu, Tao, Zhiqiang, Ding, Zhengming
Consensus clustering fuses diverse basic partitions (i.e., clustering results obtained from conventional clustering methods) into an integrated one, which has attracted increasing attention in both academic and industrial areas due to its robust and effective performance. Tremendous research efforts have been made to thrive this domain in terms of algorithms and applications. Although there are some survey papers to summarize the existing literature, they neglect to explore the underlying connection among different categories. Differently, in this paper we aim to provide an embedding prospective to illustrate the consensus mechanism, which transfers categorical basic partitions to other representations (e.g., binary coding, spectral embedding, etc) for the clustering purpose. To this end, we not only unify two major categories of consensus clustering, but also build an intuitive connection between consensus clustering and graph embedding. Moreover, we elaborate several extensions of classical consensus clustering from different settings and problems. Beyond this, we demonstrate how to leverage consensus clustering to address other tasks, such as constrained clustering, domain adaptation, feature selection, and outlier detection. Finally, we conclude this survey with future work in terms of interpretability, learnability and theoretical analysis.
Semi-Supervised Regression using Cluster Ensemble and Low-Rank Co-Association Matrix Decomposition under Uncertainties
Berikov, Vladimir, Litvinenko, Alexander
In this paper, we solve a semi-supervised regression problem. Due to the lack of knowledge about the data structure and the presence of random noise, the considered data model is uncertain. We propose a method which combines graph Laplacian regularization and cluster ensemble methodologies. The co-association matrix of the ensemble is calculated on both labeled and unlabeled data; this matrix is used as a similarity matrix in the regularization framework to derive the predicted outputs. We use the low-rank decomposition of the co-association matrix to significantly speedup calculations and reduce memory. Numerical experiments using the Monte Carlo approach demonstrate robustness, efficiency, and scalability of the proposed method.
Enhanced Ensemble Clustering via Fast Propagation of Cluster-wise Similarities
Huang, Dong, Wang, Chang-Dong, Peng, Hongxing, Lai, Jianhuang, Kwoh, Chee-Keong
Ensemble clustering has been a popular research topic in data mining and machine learning. Despite its significant progress in recent years, there are still two challenging issues in the current ensemble clustering research. First, most of the existing algorithms tend to investigate the ensemble information at the object-level, yet often lack the ability to explore the rich information at higher levels of granularity. Second, they mostly focus on the direct connections (e.g., direct intersection or pair-wise co-occurrence) in the multiple base clusterings, but generally neglect the multi-scale indirect relationship hidden in them. To address these two issues, this paper presents a novel ensemble clustering approach based on fast propagation of cluster-wise similarities via random walks. We first construct a cluster similarity graph with the base clusters treated as graph nodes and the cluster-wise Jaccard coefficient exploited to compute the initial edge weights. Upon the constructed graph, a transition probability matrix is defined, based on which the random walk process is conducted to propagate the graph structural information. Specifically, by investigating the propagating trajectories starting from different nodes, a new cluster-wise similarity matrix can be derived by considering the trajectory relationship. Then, the newly obtained cluster-wise similarity matrix is mapped from the cluster-level to the object-level to achieve an enhanced co-association (ECA) matrix, which is able to simultaneously capture the object-wise co-occurrence relationship as well as the multi-scale cluster-wise relationship in ensembles. Finally, two novel consensus functions are proposed to obtain the consensus clustering result. Extensive experiments on a variety of real-world datasets have demonstrated the effectiveness and efficiency of our approach.
Subspace Clustering using Ensembles of $K$-Subspaces
Lipor, John, Hong, David, Zhang, Dejiao, Balzano, Laura
In modern computer vision problems such as facial recognition [1] and object tracking [2], researchers have found success applying the union of subspaces (UoS) model, in which data vectors lie near one of several subspaces. Under this model, the goal is to simultaneously identify these underlying subspaces and cluster the points according to their nearest subspace. Algorithms designed to solve this problem fall under the category of subspace clustering, a topic that has received a great deal of attention in recent years [3] due to its efficacy on real-world datasets such as the Extended Yale Face Database B [4] and the MNIST handwritten digit database [5]. One of the earliest approaches to solving the subspace clustering problem involves an iterative method in the spirit of K-means, known as K-subspaces (KSS) [6], [7], [8], which alternates between assigning points to clusters and estimating the subspace basis associated with each cluster. As this algorithm is only guaranteed to converge to a local minimum, in practice one runs many instances of the algorithm and chooses the final clustering as the one that produces the minimum cost. Although its empirical performance is limited, KSS continues to serve as a benchmark for subspace clustering algorithms, in part due to its computational efficiency and simplicity. Therefore, a deeper understanding of this method is an important contribution in the area of subspace clustering and a contribution of this paper. While the KSS cost function and alternating algorithm are perhaps the most natural approach for the subspace clustering problem, it is known that there is a set of initializations of nonzero measure from which the algorithm will convergence to a point other than the global minimizer.
Simultaneous Clustering and Ensemble
Tao, Zhiqiang (Northeastern University) | Liu, Hongfu (Northeastern University) | Fu, Yun (Northeastern University)
Ensemble Clustering (EC) has gained a great deal of attention throughout the fields of data mining and machine learning, since it emerged as an effective and robust clustering framework. Typically, EC methods try to fuse multiple basic partitions (BPs) into a consensus one, of which each BP is obtained by performing traditional clustering method on the same dataset. One promising direction for ensemble clustering is to derive pairwise similarity from BPs, and then transform it as a graph partition problem. However, these graph based methods may suffer from an information loss when computing the similarity between data points, because they only utilize the categorical data provided by multiple BPs, yet neglect rich information from raw features. This problem can badly undermine the underlying cluster structure in the original feature space, and thus degrade the clustering performance. In light of this, we propose a novel Simultaneous Clustering and Ensemble (SCE) framework to alleviate such detrimental effect, which employs the similarity matrix from raw features to enhance the co-association matrix summarized by multiple BPs. Two neat closed-form solutions given by eigenvalue decomposition are provided for SCE. Experiments conducted on 16 real-world datasets demonstrate the effectiveness of the proposed SCE over the traditional clustering and state-of-the-art ensemble clustering methods. Moreover, several impact factors that may affect our method are also explored extensively.
Wisdom of Crowds cluster ensemble
Alizadeh, Hosein, Yousefnezhad, Muhammad, Bidgoli, Behrouz Minaei
The Wisdom of Crowds is a phenomenon described in social science that suggests four criteria applicable to groups of people. It is claimed that, if these criteria are satisfied, then the aggregate decisions made by a group will often be better than those of its individual members. Inspired by this concept, we present a novel feedback framework for the cluster ensemble problem, which we call Wisdom of Crowds Cluster Ensemble (WOCCE). Although many conventional cluster ensemble methods focusing on diversity have recently been proposed, WOCCE analyzes the conditions necessary for a crowd to exhibit this collective wisdom. These include decentralization criteria for generating primary results, independence criteria for the base algorithms, and diversity criteria for the ensemble members. We suggest appropriate procedures for evaluating these measures, and propose a new measure to assess the diversity. We evaluate the performance of WOCCE against some other traditional base algorithms as well as state-of-the-art ensemble methods. The results demonstrate the efficiency of WOCCE's aggregate decision-making compared to other algorithms.