Clustering
A Novel Incremental Clustering Technique with Concept Drift Detection
Woodbright, Mitchell D., Rahman, Md Anisur, Islam, Md Zahidul
Data are being collected from various aspects of life. These data can often arrive in chunks/batches. Traditional static clustering algorithms are not suitable for dynamic datasets, i.e., when data arrive in streams of chunks/batches. If we apply a conventional clustering technique over the combined dataset, then every time a new batch of data comes, the process can be slow and wasteful. Moreover, it can be challenging to store the combined dataset in memory due to its ever-increasing size. As a result, various incremental clustering techniques have been proposed. These techniques need to efficiently update the current clustering result whenever a new batch arrives, to adapt the current clustering result/solution with the latest data. These techniques also need the ability to detect concept drifts when the clustering pattern of a new batch is significantly different from older batches. Sometimes, clustering patterns may drift temporarily in a single batch while the next batches do not exhibit the drift. Therefore, incremental clustering techniques need the ability to detect a temporary drift and sustained drift. In this paper, we propose an efficient incremental clustering algorithm called UIClust. It is designed to cluster streams of data chunks, even when there are temporary or sustained concept drifts. We evaluate the performance of UIClust by comparing it with a recently published, high-quality incremental clustering algorithm. We use real and synthetic datasets. We compare the results by using well-known clustering evaluation criteria: entropy, sum of squared errors (SSE), and execution time. Our results show that UIClust outperforms the existing technique in all our experiments.
Elastic Coupled Co-clustering for Single-Cell Genomic Data
Zeng, Pengcheng, Lin, Zhixiang
The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and data sets from multiple domains are available, including data sets that profile different types of genomic features and data sets that profile the same type of genomic features across different species. These data sets typically have different powers in identifying the unknown cell types through clustering, and data integration can potentially lead to a better performance of clustering algorithms. In this work, we formulate the problem in an unsupervised transfer learning framework, which utilizes knowledge learned from auxiliary data set to improve the clustering performance of target data set. The degree of shared information among the target and auxiliary data sets can vary, and their distributions can also be different. To address these challenges, we propose an elastic coupled co-clustering based transfer learning algorithm, by elastically propagating clustering knowledge obtained from the auxiliary data set to the target data set. Implementation on single-cell genomic data sets shows that our algorithm greatly improves clustering performance over the traditional learning algorithms. The source code and data sets are available at https://github.com/cuhklinlab/elasticC3.
ABBA: Adaptive Brownian bridge-based symbolic aggregation of time series
Elsworth, Steven, Gรผttel, Stefan
Symbolic time series representations allow for the use of algorithms from text processing and bioinformatics, which often take advantage of the discrete nature of the data. Our focus in this work is to develop a symbolic representation which is dimension reducing whilst preserving the essential shape of the time series. Our definition of shape is different from the one commonly implied in the context of time series: we focus on representing the peaks and troughs of the time series in their correct order of appearance, but we are happy to slightly stretch the time series in both the time and value directions. In other words, our focus is not necessarily on approximating the time series values at the correct time points, but on representing the local up-and-down behavior of the time series and identifying repeated motifs. This is obviously not appropriate in all applications, but we believe it is close to how humans summarize the overall behavior of a time series, and in that our representation might be useful for trend prediction, anomaly detection, and motif discovery. To illustrate, let us consider the time series shown in Figure 1. This series is sampled at equidistant time points with values t 0,t 1,...,t N R, where N 230. There are various ways of describing this time series, for example: (a) It is exactly representable as a high-dimensional vector T [t 0,t 1,...,t N ] R N 1 .
Kernel Truncated Regression Representation for Robust Subspace Clustering
Zhen, Liangli, Peng, Dezhong, Wang, Wei, Yao, Xin
Subspace clustering aims to group data points into multiple clusters of which each corresponds to one subspace. Most existing subspace clustering approaches assume that input data lie on linear subspaces. In practice, however, this assumption usually does not hold. To achieve nonlinear subspace clustering, we propose a novel method, called kernel truncated regression representation. Our method consists of the following four steps: 1) projecting the input data into a hidden space, where each data point can be linearly represented by other data points; 2) calculating the linear representation coefficients of the data representations in the hidden space; 3) truncating the trivial coefficients to achieve robustness and block-diagonality; and 4) executing the graph cutting operation on the coefficient matrix by solving a graph Laplacian problem. Our method has the advantages of a closed-form solution and the capacity of clustering data points that lie on nonlinear subspaces. The first advantage makes our method efficient in handling large-scale datasets, and the second one enables the proposed method to conquer the nonlinear subspace clustering challenge. Extensive experiments on six benchmarks demonstrate the effectiveness and the efficiency of the proposed method in comparison with current state-of-the-art approaches.
The impossibility of low rank representations for triangle-rich complex networks
Seshadhri, C., Sharma, Aneesh, Stolman, Andrew, Goel, Ashish
The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not}capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have rank nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as Singular Value Decomposition and node2vec fail to capture significant structural aspects of real-world complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure.
Resolving single-cell heterogeneity from hundreds of thousands of cells through sequential hybrid clustering and NMF
We describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse-NMF, cluster "fitness", SVM) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, we show that the PageRank algorithm effectively down-samples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell-types and while recovering novel transcriptionally distinct cell populations. We believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.
Unsupervised Fuzzy eIX: Evolving Internal-eXternal Fuzzy Clustering
Aguiar, Charles, Leite, Daniel
Time-varying classifiers, namely, evolving classifiers, play an important role in a scenario in which information is available as a never-ending online data stream. We present a new unsupervised learning method for numerical data called evolving Internal-eXternal Fuzzy clustering method (Fuzzy eIX). We develop the notion of double-boundary fuzzy granules and elaborate on its implications. Type 1 and type 2 fuzzy inference systems can be obtained from the projection of Fuzzy eIX granules. We perform the principle of the balanced information granularity within Fuzzy eIX classifiers to achieve a higher level of model understandability. Internal and external granules are updated from a numerical data stream at the same time that the global granular structure of the classifier is autonomously evolved. A synthetic nonstationary problem called Rotation of Twin Gaussians shows the behavior of the classifier. The Fuzzy eIX classifier could keep up with its accuracy in a scenario in which offline-trained classifiers would clearly have their accuracy drastically dropped.
Incorporating User's Preference into Attributed Graph Clustering
Ye, Wei, Mautz, Dominik, Boehm, Christian, Singh, Ambuj, Plant, Claudia
Graph clustering has been studied extensively on both plain graphs and attributed graphs. However, all these methods need to partition the whole graph to find cluster structures. Sometimes, based on domain knowledge, people may have information about a specific target region in the graph and only want to find a single cluster concentrated on this local region. Such a task is called local clustering. In contrast to global clustering, local clustering aims to find only one cluster that is concentrating on the given seed vertex (and also on the designated attributes for attributed graphs). Currently, very few methods can deal with this kind of task. To this end, we propose two quality measures for a local cluster: Graph Unimodality (GU) and Attribute Unimodality (AU). The former measures the homogeneity of the graph structure while the latter measures the homogeneity of the subspace that is composed of the designated attributes. We call their linear combination as Compactness. Further, we propose LOCLU to optimize the Compactness score. The local cluster detected by LOCLU concentrates on the region of interest, provides efficient information flow in the graph and exhibits a unimodal data distribution in the subspace of the designated attributes.
Tree Index: A New Cluster Evaluation Technique
Beg, A. H., Islam, Md Zahidul, Estivill-Castro, Vladimir
We introduce a cluster evaluation technique called Tree Index. Our Tree Index algorithm aims at describing the structural information of the clustering rather than the quantitative format of cluster-quality indexes (where the representation power of clustering is some cumulative error similar to vector quantization). Our Tree Index is finding margins amongst clusters for easy learning without the complications of Minimum Description Length. Our Tree Index produces a decision tree from the clustered data set, using the cluster identifiers as labels. It combines the entropy of each leaf with their depth. Intuitively, a shorter tree with pure leaves generalizes the data well (the clusters are easy to learn because they are well separated). So, the labels are meaningful clusters. If the clustering algorithm does not separate well, trees learned from their results will be large and too detailed. We show that, on the clustering results (obtained by various techniques) on a brain dataset, Tree Index discriminates between reasonable and non-sensible clusters. We confirm the effectiveness of Tree Index through graphical visualizations. Tree Index evaluates the sensible solutions higher than the non-sensible solutions while existing cluster-quality indexes fail to do so.
How machine learning can power your business
An unprecedented volume of data is currently being generated across the globe with no less than an estimated 2.5 quintillion (1018) bytes of data each day at our current pace. The variety of formats in which this data is being produced, and its structural complexity, are also on the rise. Collectively, these factors are driving demand among institutions for advanced analytics to generate actionable insights. At the most basic level, machine learning encompasses the use of computational algorithms more advanced than the analytics methods (data mining approaches, for example) traditionally employed to deliver insights into large datasets. Machine learning techniques are firmly rooted in the science of statistics and have valuable applications not least in financial services.