Goto

Collaborating Authors

 Clustering


Hypergraph clustering with categorical edge labels

arXiv.org Machine Learning

Graphs and networks are a standard model for describing data or systems based on pairwise interactions. Oftentimes, the underlying relationships involve more than two entities at a time, and hypergraphs are a more faithful model. However, we have fewer rigorous methods that can provide insight from such representations. Here, we develop a computational framework for the problem of clustering hypergraphs with categorical edge labels --- or different interaction types --- where clusters corresponds to groups of nodes that frequently participate in the same type of interaction. Our methodology is based on a combinatorial objective function that is related to correlation clustering but enables the design of much more efficient algorithms. When there are only two label types, our objective can be optimized in polynomial time, using an algorithm based on minimum cuts. Minimizing our objective becomes NP-hard with more than two label types, but we develop fast approximation algorithms based on linear programming relaxations that have theoretical cluster quality guarantees. We demonstrate the efficacy of our algorithms and the scope of the model through problems in edge-label community detection, clustering with temporal data, and exploratory data analysis.


A Unified Framework for Tuning Hyperparameters in Clustering Problems

#artificialintelligence

Selecting hyperparameters for unsupervised learning problems is difficult in general due to the lack of ground truth for validation. However, this issue is prevalent in machine learning, especially in clustering problems with examples including the Lagrange multipliers of penalty terms in semidefinite programming (SDP) relaxations and the bandwidths used for constructing kernel similarity matrices for Spectral Clustering. Despite this, there are not many provable algorithms for tuning these hyperparameters. In this paper, we provide a unified framework with provable guarantees for the above class of problems. We demonstrate our method on two distinct models.


Sampling random graph homomorphisms and applications to network data analysis

arXiv.org Machine Learning

A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph $F$ into a large network $\mathcal{G}$. When $\mathcal{G}$ is the complete graph with $q$ nodes, this becomes the well-known problem of sampling uniform $q$-colorings of $F$. We propose two complementary MCMC algorithms for sampling a random graph homomorphisms and establish bounds on their mixing times and concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neigborhood sampling. Various time averages of the MCMC trajectory give us real-, function-, and network-valued computable observables, including well-known ones such as homomorphism density and average clustering coefficient. One of the main observable we propose is called the conditional homomorphism density profile, which reveals hierarchical structure of the network. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We also provide various examples and simulations demonstrating our framework through synthetic and real-world networks. For instance, we apply our framework to analyze Word Adjacency Networks of a 45 novels data set and propose an authorship attribution scheme using motif sampling and conditional homomorphism density profiles.


Multi-level conformal clustering: A distribution-free technique for clustering and anomaly detection

arXiv.org Machine Learning

In this work we present a clustering technique called \textit{multi-level conformal clustering (MLCC)}. The technique is hierarchical in nature because it can be performed at multiple significance levels which yields greater insight into the data than performing it at just one level. We describe the theoretical underpinnings of MLCC, compare and contrast it with the hierarchical clustering algorithm, and then apply it to real world datasets to assess its performance. There are several advantages to using MLCC over more classical clustering techniques: Once a significance level has been set, MLCC is able to automatically select the number of clusters. Furthermore, thanks to the conformal prediction framework the resulting clustering model has a clear statistical meaning without any assumptions about the distribution of the data. This statistical robustness also allows us to perform clustering and anomaly detection simultaneously. Moreover, due to the flexibility of the conformal prediction framework, our algorithm can be used on top of many other machine learning algorithms.


Landing Probabilities of Random Walks for Seed-Set Expansion in Hypergraphs

arXiv.org Machine Learning

Landing Probabilities of Random Walks for Seed-Set Expansion in Hypergraphs Eli Chien Pan Li Olgica Milenkovic Department ECE, UIUC Department ECE, UIUC Department ECE, UIUC Abstract We describe the first known mean-field study of landing probabilities for random walks on hypergraphs. In particular, we examine clique-expansion and tensor methods and evaluate their mean-field characteristics over a class of random hypergraph models for the purpose of seed-set community expansion. We describe parameter regimes in which the two methods outperform each other and propose a hybrid expansion method that uses partial clique-expansion to reduce the projection distortion and low-complexity tensor methods applied directly on the partially expanded hypergraphs. 1 1 Introduction Random walks on graphs are Markov random processes in which given a starting vertex, one moves to a randomly selected neighbor and then repeats the procedure starting from the newly selected vertex [1]. Random walks are used in many graph-based learning algorithms such as PageRank [2] and Label Propagating [3], and they have found a variety of applications in local community detection [4, 5], information retrieval [2] and semi-supervised learning [3]. Random walks are also frequently used to characterize the topological structure of graphs via the hitting time of a vertex from a seed, the commute time between two vertices [6] and the mixing time which also characterizes global graph connectivity [7]. Recently, a new measure of vertex connectivity and similarity, termed a landing probability (LP), was introduced in [8]. A1 Eli Chien and Pan Li contribute equally to this work.Preprint version. LP of a vertex is the probability of a random walk ending at the vertex after making a certain number of steps. Different linear combinations of LPs give rise to different forms of PageRanks (PRs), such as the standard PR [2] and the heat-kernel PR [9], both used for various graph clustering tasks. In particular, Kloumann et al. [8] also initiated the analysis of PRs based on LPs for seed-based community detection. Under the assumption of a generative stochastic block model (SBM) [10] with two blocks, the authors of [8] proved that the empirical average of LPs within the seed community concentrates around a deterministic centroid. Similarly, the empirical averages of LPs outside the seed community also concentrate around another deterministic centroid.


Differentiable Deep Clustering with Cluster Size Constraints

arXiv.org Machine Learning

Clustering is a fundamental unsupervised learning approach. Many clustering algorithms -- such as $k$-means -- rely on the euclidean distance as a similarity measure, which is often not the most relevant metric for high dimensional data such as images. Learning a lower-dimensional embedding that can better reflect the geometry of the dataset is therefore instrumental for performance. We propose a new approach for this task where the embedding is performed by a differentiable model such as a deep neural network. By rewriting the $k$-means clustering algorithm as an optimal transport task, and adding an entropic regularization, we derive a fully differentiable loss function that can be minimized with respect to both the embedding parameters and the cluster parameters via stochastic gradient descent. We show that this new formulation generalizes a recently proposed state-of-the-art method based on soft-$k$-means by adding constraints on the cluster sizes. Empirical evaluations on image classification benchmarks suggest that compared to state-of-the-art methods, our optimal transport-based approach provide better unsupervised accuracy and does not require a pre-training phase.


Sparse-Dense Subspace Clustering

arXiv.org Machine Learning

Subspace clustering refers to the problem of clustering high-dimensional data into a union of low-dimensional subspaces. Current subspace clustering approaches are usually based on a two-stage framework. In the first stage, an affinity matrix is generated from data. In the second one, spectral clustering is applied on the affinity matrix. However, the affinity matrix produced by two-stage methods cannot fully reveal the similarity between data points from the same subspace (intra-subspace similarity), resulting in inaccurate clustering. Besides, most approaches fail to solve large-scale clustering problems due to poor efficiency. In this paper, we first propose a new scalable sparse method called Iterative Maximum Correlation (IMC) to learn the affinity matrix from data. Then we develop Piecewise Correlation Estimation (PCE) to densify the intra-subspace similarity produced by IMC. Finally we extend our work into a Sparse-Dense Subspace Clustering (SDSC) framework with a dense stage to optimize the affinity matrix for two-stage methods. We show that IMC is efficient when clustering large-scale data, and PCE ensures better performance for IMC. We show the universality of our SDSC framework as well. Experiments on several data sets demonstrate the effectiveness of our approaches. Moreover, we are the first one to apply densification on affinity matrix before spectral clustering, and SDSC constitutes the first attempt to build a universal three-stage subspace clustering framework.


Benchmark Dataset for Timetable Optimization of Bus Routes in the City of New Delhi

arXiv.org Machine Learning

Public transport is one of the major forms of transportation in the world. This makes it vital to ensure that public transport is efficient. This research presents a novel real-time GPS bus transit data for over 500 routes of buses operating in New Delhi. The data can be used for modeling various timetable optimization tasks as well as in other domains such as traffic management, travel time estimation, etc. The paper also presents an approach to reduce the waiting time of Delhi buses by analyzing the traffic behavior and proposing a timetable. This algorithm serves as a benchmark for the dataset. The algorithm uses a constrained clustering algorithm for classification of trips. It further analyses the data statistically to provide a timetable which is efficient in learning the inter- and intra-month variations.


Identification of Interaction Clusters Using a Semi-supervised Hierarchical Clustering Method

arXiv.org Machine Learning

Motivation: Identifying interaction clusters of large gene regulatory networks (GRNs) is critical for its further investigation, while this task is very challenging, attributed to data noise in experiment data, large scale of GRNs, and inconsistency between gene expression profiles and function modules, etc. It is promising to semi-supervise this process by prior information, but shortage of prior information sometimes make it very challenging. Meanwhile, it is also annoying, and sometimes impossible to discovery gold standard for evaluation of clustering results.\\ Results: With assistance of an online enrichment tool, this research proposes a semi-supervised hierarchical clustering method via deconvolved correlation matrix~(SHC-DC) to discover interaction clusters of large-scale GRNs. Three benchmark networks including a \emph{Ecoli} network and two \emph{Yeast} networks are employed to test semi-supervision scheme of the proposed method. Then, SHC-DC is utilized to cluster genes in sleep study. Results demonstrates it can find interaction modules that are generally enriched in various signal pathways. Besides the significant influence on blood level of interleukins, impact of sleep on important pathways mediated by them is also validated by the discovered interaction modules.


Top three mistakes with K-Means Clustering during data analysis

#artificialintelligence

In this post, we will take a look at a few cases, where KMC algorithm does not perform well or may produce unintuitive results. All of these conditions can lead to problems with K-Means, so let's have a look. To make it easier, let's define a helper function compare, which will create and solve the clustering problem for us and then compare the results. Despite having distinct clusters in the data, we underestimated their number. As a consequence, some disjoint groups of data are forced to fit into one larger cluster.