Clustering
Mixed-Order Spectral Clustering for Networks
Ge, Yan, Lu, Haiping, Peng, Pan
Clustering is fundamental for gaining insights from complex networks, and spectral clustering (SC) is a popular approach. Conventional SC focuses on second-order structures (e.g., edges connecting two nodes) without direct consideration of higher-order structures (e.g., triangles and cliques). This has motivated SC extensions that directly consider higher-order structures. However, both approaches are limited to considering a single order. This paper proposes a new Mixed-Order Spectral Clustering (MOSC) approach to model both second-order and third-order structures simultaneously, with two MOSC methods developed based on Graph Laplacian (GL) and Random Walks (RW). MOSC-GL combines edge and triangle adjacency matrices, with theoretical performance guarantee. MOSC-RW combines first-order and second-order random walks for a probabilistic interpretation. We automatically determine the mixing parameter based on cut criteria or triangle density, and construct new structure-aware error metrics for performance evaluation. Experiments on real-world networks show 1) the superior performance of two MOSC methods over existing SC methods, 2) the effectiveness of the mixing parameter determination strategy, and 3) insights offered by the structure-aware error metrics.
Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process Mixtures
Duan, Tiehang, Pinto, Josรฉ P., Xie, Xiaohui
Motivation: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (1) the clustering quality still needs to be improved; (2) most models need prior knowledge on number of clusters, which is not always available; (3) there is a demand for faster computational speed. Results: We propose to tackle these challenges with Parallel Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive clustering on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package
Deep Representation Learning for Clustering of Health Tweets
Twitter has been a prominent social media platform for mining population-level health data and accurate clustering of health-related tweets into topics is important for extracting relevant health insights. In this work, we propose deep convolutional autoencoders for learning compact representations of health-related tweets, further to be employed in clustering. We compare our method to several conventional tweet representation methods including bag-of-words, term frequency-inverse document frequency, Latent Dirichlet Allocation and Non-negative Matrix Factorization with 3 different clustering algorithms. Our results show that the clustering performance using proposed representation learning scheme significantly outperforms that of conventional methods for all experiments of different number of clusters. In addition, we propose a constraint on the learned representations during the neural network training in order to further enhance the clustering performance. All in all, this study introduces utilization of deep neural network-based architectures, i.e., deep convolutional autoencoders, for learning informative representations of health-related tweets.
Nonparametric Feature Extraction from Dendrograms
Chehreghani, Morteza Haghir, Chehreghani, Mostafa Haghir
We study nonparametric feature extraction from hierarchies. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with the definition of specific forms of a level function and a distance function over that. Therefore, we develop a generalized framework wherein different distance measures can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep learning models. Finally, we demonstrate the effectiveness of our approach via numerical studies.
Reliable Agglomerative Clustering
We analyze the general behavior of agglomerative clustering methods, and argue that their strategy yields establishment of a new reliable linkage at each step. However, in order to provide adaptive, density-consistent and flexible solutions, we propose to extract all the reliable linkages at each step, instead of the smallest one. This leads to a new agglomerative clustering strategy, called reliable agglomerative clustering, which similar to the standard agglomerative variant can be applied with all common criteria. Moreover, we prove that this strategy with the \emph{single} linkage criterion yields a minimum spanning tree algorithm. We perform experiments on several real-world datasets to demonstrate the superior performance of this strategy, compared to the standard alternative.
Kappa Learning: A New Method for Measuring Similarity Between Educational Items Using Performance Data
Nazaretsky, Tanya, Hershkovitz, Sara, Alexandron, Giora
Sequencing items in adaptive learning systems typically relies on a large pool of interactive assessment items (questions) that are analyzed into a hierarchy of skills or Knowledge Components (KCs). Educational data mining techniques can be used to analyze students performance data in order to optimize the mapping of items to KCs. Standard methods that map items into KCs using item-similarity measures make the implicit assumption that students performance on items that depend on the same skill should be similar. This assumption holds if the latent trait (mastery of the underlying skill) is relatively fixed during students activity, as in the context of testing, which is the primary context in which these measures were developed and applied. However, in adaptive learning systems that aim for learning, and address subject matters such as K6 Math that consist of multiple sub-skills, this assumption does not hold. In this paper we propose a new item-similarity measure, termed Kappa Learning (KL), which aims to address this gap. KL identifies similarity between items under the assumption of learning, namely, that learners mastery of the underlying skills changes as they progress through the items. We evaluate Kappa Learning on data from a computerized tutor that teaches Fractions for 4th grade, with experts tagging as ground truth, and on simulated data. Our results show that clustering that is based on Kappa Learning outperforms clustering that is based on commonly used similarity measures (Cohen Kappa, Yule, and Pearson).
A Fuzzy Community-Based Recommender System Using PageRank
Goliforoushani, Maliheh, Rad, Radin Hamidi, Haeri, Maryam Amir
Recommendation systems are widely used by different user service providers specially those who have interactions with the large community of users. This paper introduces a recommender system based on community detection. The recommendation is provided using the local and global similarities between users. The local information is obtained from communities, and the global ones are based on the ratings. Here, a new fuzzy community detection using the personalized PageRank metaphor is introduced. The fuzzy membership values of the users to the communities are utilized to define a similarity measure. The method is evaluated by using two well-known datasets: MovieLens and FilmTrust. The results show that our method outperforms recent recommender systems.
Robust Graph Learning from Noisy Data
Kang, Zhao, Pan, Haiqi, Hoi, Steven C. H., Xu, Zenglin
Learning graphs from data automatically has shown encouraging performance on clustering and semisupervised learning tasks. However, real data are often corrupted, which may cause the learned graph to be inexact or unreliable. In this paper, we propose a novel robust graph learning scheme to learn reliable graphs from real-world noisy data by adaptively removing noise and errors in the raw data. We show that our proposed model can also be viewed as a robust version of manifold regularized robust PCA, where the quality of the graph plays a critical role. The proposed model is able to boost the performance of data clustering, semisupervised classification, and data recovery significantly, primarily due to two key factors: 1) enhanced low-rank recovery by exploiting the graph smoothness assumption, 2) improved graph construction by exploiting clean data recovered by robust PCA. Thus, it boosts the clustering, semi-supervised classification, and data recovery performance overall. Extensive experiments on image/document clustering, object recognition, image shadow removal, and video background subtraction reveal that our model outperforms the previous state-of-the-art methods.
Higher-Order Spectral Clustering under Superimposed Stochastic Block Model
Paul, Subhadeep, Milenkovic, Olgica, Chen, Yuguo
Higher-order motif structures and multi-vertex interactions are becoming increasingly important in studies that aim to improve our understanding of functionalities and evolution patterns of networks. To elucidate the role of higher-order structures in community detection problems over complex networks, we introduce the notion of a Superimposed Stochastic Block Model (SupSBM). The model is based on a random graph framework in which certain higher-order structures or subgraphs are generated through an independent hyperedge generation process, and are then replaced with graphs that are superimposed with directed or undirected edges generated by an inhomogeneous random graph model. Consequently, the model introduces controlled dependencies between edges which allow for capturing more realistic network phenomena, namely strong local clustering in a sparse network, short average path length, and community structure. We proceed to rigorously analyze the performance of a number of recently proposed higher-order spectral clustering methods on the SupSBM. In particular, we prove non-asymptotic upper bounds on the misclustering error of spectral community detection for a SupSBM setting in which triangles or 3-uniform hyperedges are superimposed with undirected edges. As part of our analysis, we also derive new bounds on the misclustering error of higher-order spectral clustering methods for the standard SBM and the 3-uniform hypergraph SBM. Furthermore, for a non-uniform hypergraph SBM model in which one directly observes both edges and 3-uniform hyperedges, we obtain a criterion that describes when to perform spectral clustering based on edges and when on hyperedges, based on a function of hyperedge density and observation quality.
Deep Clustering Based on a Mixture of Autoencoders
Chazan, Shlomo E., Gannot, Sharon, Goldberger, Jacob
In this paper we propose a Deep Autoencoder MIxture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.