new cluster
Export Reviews, Discussions, Author Feedback and Meta-Reviews
Review Summary Score: 6, Marginally above the acceptance threshold The proposed method for streaming, distributed inference of DP mixture models presents a nice solution to the cluster identification problem, backed by experiments that are convincing though not rock solid. I'm hesitant to recommend unconditional acceptance, because basic information about how new clusters are created at each minibatch are totally absent, hurting reproducibility. Summary of Paper This paper develops a new algorithm for streaming, distributed variational inference for the DP mixture model, with some supplementary material suggesting how to use these insights for many other BNP models. Using a mean-field approximation, the authors consider how to allow multiple worker nodes to process data batches in parallel and then aggregate these results asynchronously. In particular, the authors offer a new solution to the "component identification" problem: how to find correspondence between new clusters created independently by two separate worker nodes.
Graph Community Augmentation with GMM-based Modeling in Latent Space
Fukushima, Shintaro, Yamanishi, Kenji
This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community. There is a chance of discovering an unseen but important structure of graphs with a new community, for example, in a social network such as a purchaser network. Graph community augmentation may also be helpful for generalization of data mining models in a case where it is difficult to collect real graph data enough. In fact, there are many ways to generate a new community in an existing graph. It is desirable to discover a new graph with a new community beyond the given graph while we keep the structure of the original graphs to some extent for the generated graphs to be realistic. To this end, we propose an algorithm called the graph community augmentation (GCA). The key ideas of GCA are (i) to fit Gaussian mixture model (GMM) to data points in the latent space into which the nodes in the original graph are embedded, and (ii) to add data points in the new cluster in the latent space for generating a new community based on the minimum description length (MDL) principle. We empirically demonstrate the effectiveness of GCA for generating graphs with a new community structure on synthetic and real datasets.
Evolving Text Data Stream Mining
A text stream is an ordered sequence of text documents generated over time. A massive amount of such text data is generated by online social platforms every day. Designing an algorithm for such text streams to extract useful information is a challenging task due to unique properties of the stream such as infinite length, data sparsity, and evolution. Thereby, learning useful information from such streaming data under the constraint of limited time and memory has gained increasing attention. During the past decade, although many text stream mining algorithms have proposed, there still exists some potential issues. First, high-dimensional text data heavily degrades the learning performance until the model either works on subspace or reduces the global feature space. The second issue is to extract semantic text representation of documents and capture evolving topics over time. Moreover, the problem of label scarcity exists, whereas existing approaches work on the full availability of labeled data. To deal with these issues, in this thesis, new learning models are proposed for clustering and multi-label learning on text streams.
bca82e41ee7b0833588399b1fcd177c7-Reviews.html
The authors propose a parallel algorithm for the DPMM that parallelizes a RJMCMC sampler that jumps between finite models. While the parallelization and the RJMCMC sampler are proposed together, I will separate them for the purpose of this review, in order to ask questions about each part separately. First, the RJMCMC algorithm (by which I mean, the algorithm we would have on a single cluster). Here, we use a reversible-jump MCMC algorithm to jump between finite-dimensional Dirichlet distributions. As an aside, since \bar{\pi}_{K 1} is not used in the mixture model (the mixture model is defined on the renormalized occupied K components), it would seem to make more sense to define a K-dimensional, rather than a K-1 - dimensional, Dirichlet distribution; this is valid under marginalization properties of the Dirichlet distribution, since equation 10 samples from a distribution proportional to \pi_1 ... \pi_K To jump between model dimensionalities, the authors propose a split/merge RJMCMC step that is reminiscent of that of Green and Richardson.
Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture Miao Liu MIT
This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets.
Kernel KMeans clustering splits for end-to-end unsupervised decision trees
Ohl, Louis, Mattei, Pierre-Alexandre, Leclercq, Mickaël, Droit, Arnaud, Precioso, Frédéric
Trees are convenient models for obtaining explainable predictions on relatively small datasets. Although there are many proposals for the end-to-end construction of such trees in supervised learning, learning a tree end-to-end for clustering without labels remains an open challenge. As most works focus on interpreting with trees the result of another clustering algorithm, we present here a novel end-to-end trained unsupervised binary tree for clustering: Kauri. This method performs a greedy maximisation of the kernel KMeans objective without requiring the definition of centroids. We compare this model on multiple datasets with recent unsupervised trees and show that Kauri performs identically when using a linear kernel. For other kernels, Kauri often outperforms the concatenation of kernel KMeans and a CART decision tree.
Tensor Dirichlet Process Multinomial Mixture Model for Passenger Trajectory Clustering
Li, Ziyue, Yan, Hao, Zhang, Chen, Wang, Andi, Ketter, Wolfgang, Sun, Lijun, Tsung, Fugee
Passenger clustering based on travel records is essential for transportation operators. However, existing methods cannot easily cluster the passengers due to the hierarchical structure of the passenger trip information, namely: each passenger has multiple trips, and each trip contains multi-dimensional multi-mode information. Furthermore, existing approaches rely on an accurate specification of the clustering number to start, which is difficult when millions of commuters are using the transport systems on a daily basis. In this paper, we propose a novel Tensor Dirichlet Process Multinomial Mixture model (Tensor-DPMM), which is designed to preserve the multi-mode and hierarchical structure of the multi-dimensional trip information via tensor, and cluster them in a unified one-step manner. The model also has the ability to determine the number of clusters automatically by using the Dirichlet Process to decide the probabilities for a passenger to be either assigned in an existing cluster or to create a new cluster: This allows our model to grow the clusters as needed in a dynamic manner. Finally, existing methods do not consider spatial semantic graphs such as geographical proximity and functional similarity between the locations, which may cause inaccurate clustering. To this end, we further propose a variant of our model, namely the Tensor-DPMM with Graph. For the algorithm, we propose a tensor Collapsed Gibbs Sampling method, with an innovative step of "disband and relocating", which disbands clusters with too small amount of members and relocates them to the remaining clustering. This avoids uncontrollable growing amounts of clusters. A case study based on Hong Kong metro passenger data is conducted to demonstrate the automatic process of learning the number of clusters, and the learned clusters are better in within-cluster compactness and cross-cluster separateness.
Hierarchical Clustering: A Practical Introduction of Agglomerative and Divisive Methods
In this article, we are going to talk in detail about hierarchical clustering like Why we need hierarchical clustering?, How hierarchical clustering works?, Types of hierarchical clustering?, On which dataset it is applicable? . Before moving forward to hierarchal clustering, we should know why we are talking about hierarchical clustering? even when we have K Means clustering. If you have studied K Means then you know that this algorithm works on the distance to centroid method to create a cluster. Although it works well if you have well defined boundaries type dataset that has less outliers. In above picture, K Means is working well but when we move towards some complex datasets then the problem arises and K Means don't work properly. As you can see in below picture, K Means is failing in making clusters.
5 Clustering Algorithms Data Scientists Need To Know - The Key Is Always To Understand The Basic Approach Of Any Algorithm You Want To Use – Fly Spaceships With Your Mind
As a data scientist, you have several basic tools at your disposal, which you can also apply in combination to a data set. More and more complex dependencies are formed. This makes it all the more difficult to recognize these similar properties and to assign the data to so-called clusters in a way that can be evaluated. You have certainly heard of these algorithms and maybe used one or the other, but do you really know what clustering algorithms are? So let's first clarify what these algorithms are in the first place.
Deep Bayesian Unsupervised Lifelong Learning
Zhao, Tingting, Wang, Zifeng, Masoomi, Aria, Dy, Jennifer
Lifelong Learning (LL) refers to the ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Much attention has been given lately to Supervised Lifelong Learning (SLL) with a stream of labelled data. In contrast, we focus on resolving challenges in Unsupervised Lifelong Learning (ULL) with streaming unlabelled data when the data distribution and the unknown class labels evolve over time. Bayesian framework is natural to incorporate past knowledge and sequentially update the belief with new data. We develop a fully Bayesian inference framework for ULL with a novel end-to-end Deep Bayesian Unsupervised Lifelong Learning (DBULL) algorithm, which can progressively discover new clusters without forgetting the past with unlabelled data while learning latent representations. To efficiently maintain past knowledge, we develop a novel knowledge preservation mechanism via sufficient statistics of the latent representation for raw data. To detect the potential new clusters on the fly, we develop an automatic cluster discovery and redundancy removal strategy in our inference inspired by Nonparametric Bayesian statistics techniques. We demonstrate the effectiveness of our approach using image and text corpora benchmark datasets in both LL and batch settings.