Goto

Collaborating Authors

 Clustering


A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

arXiv.org Machine Learning

Abstract-- As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading to the problems of sparse cluster loss and cluster fragmentation. T o address these problems, we propose a Domain-Adaptive Density Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak thresholds, which results in the loss of sparse clusters. We treat each data point and its KNN neighborhood as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirement and non-iterative nature, DADC achieves low computational complexity and is suitable for large-scale data clustering. Numerous clustering algorithms have been proposed, including the partitioning-based, hierarchical-based, density-based, grid-based, model-based, and density-peak-based methods [3-6]. Among them, density-based methods (e.g., DBSCAN, CLIQUE, and OPTICS) can effectively discover clusters of arbitrary shape using the density connectivity of clusters, and do not require a predefined number of clusters [6]. In recent years, Density-Peak-based Clustering (DPC) algorithms, as a branch of density-based clustering, were introduced in [7, 8], assuming that the cluster centers are surrounded by low-density neighbors and can be detected by efficiently searching for local density peaks. Benefitting from few parameter requirements and non-iterative nature, DPC algorithms can efficiently detect clusters of arbitrarily shape from large-scale datasets with low computational complexity . However, as shown in Figure 1, DPC algorithms have limited clustering effect on data with varying density distribution (VDD), multiple domain-density maximums (MDDM), or equilibrium distribution (ED).


Clustering Metrics Better Than the Elbow Method - KDnuggets

#artificialintelligence

Clustering is an important part of the machine learning pipeline for business or scientific enterprises utilizing data science. As the name suggests, it helps to identify congregations of closely related (by some measure of distance) data points in a blob of data, which, otherwise, would be difficult to make sense of. However, mostly, the process of clustering falls under the realm of unsupervised machine learning. And unsupervised ML is a messy business. There is no known answers or labels to guide the optimization process or measure our success against.


A Probabilistic Approach for Discovering Daily Human Mobility Patterns with Mobile Data

arXiv.org Machine Learning

--Discovering human mobility patterns with geo-location data collected from smartphone users has been a hot research topic in recent years. In this paper, we attempt to discover daily mobile patterns based on GPS data. We view this problem from a probabilistic perspective in order to explore more information from the original GPS data compared to other conventional methods. A non-parameter Bayesian modeling method, Infinite Gaussian Mixture Model, is used to estimate the probability density for the daily mobility. Then, we use Kullback-Leibler divergence as the metrics to measure the similarity of different probability distributions. And combining Infinite Gaussian Mixture Model and Kullback-Leibler divergence, we derived an automatic clustering algorithm to discover mobility patterns for each individual user without setting the number of clusters in advance. In the experiments, the effectiveness of our method is validated on the real user data collected from different users. The results show that the IGMM-based algorithm outperforms the GMM-based algorithm. We also test our methods on the dataset with different lengths to discover the minimum data length for discovering mobility patterns. I NTRODUCTION S MARTPHONEdevices are equipped with multiple sensors that can record user behavior on the handsets. With the help of a large-scale smartphone usage data, researchers are able to study human behavior in the real world.


Memory-Efficient Episodic Control Reinforcement Learning with Dynamic Online k-means

arXiv.org Machine Learning

Recently, neuro-inspired episodic control (EC) methods have been developed to overcome the data-inefficiency of standard deep reinforcement learning approaches. Using non-/semi-parametric models to estimate the value function, they learn rapidly, retrieving cached values from similar past states. In realistic scenarios, with limited resources and noisy data, maintaining meaningful representations in memory is essential to speed up the learning and avoid catastrophic forgetting. Unfortunately, EC methods have a large space and time complexity. We investigate different solutions to these problems based on prioritising and ranking stored states, as well as online clustering techniques. We also propose a new dynamic online k-means algorithm that is both computationally-efficient and yields significantly better performance at smaller memory sizes; we validate this approach on classic reinforcement learning environments and Atari games.


Visual Tactile Fusion Object Clustering

arXiv.org Machine Learning

Object clustering, aiming at grouping similar objects into one cluster with an unsupervised strategy, has been extensivelystudied among various data-driven applications. However, most existing state-of-the-art object clustering methods (e.g., single-view or multi-view clustering methods) only explore visual information, while ignoring one of most important sensing modalities, i.e., tactile information which can help capture different object properties and further boost the performance of object clustering task. To effectively benefit both visual and tactile modalities for object clustering, in this paper, we propose a deep Auto-Encoder-like Non-negative Matrix Factorization framework for visual-tactile fusion clustering. Specifically, deep matrix factorization constrained by an under-complete Auto-Encoder-like architecture is employed to jointly learn hierarchical expression of visual-tactile fusion data, and preserve the local structure of data generating distribution of visual and tactile modalities. Meanwhile, a graph regularizer is introduced to capture the intrinsic relations of data samples within each modality. Furthermore, we propose a modality-level consensus regularizer to effectively align thevisual and tactile data in a common subspace in which the gap between visual and tactile data is mitigated. For the model optimization, we present an efficient alternating minimization strategy to solve our proposed model. Finally, we conduct extensive experiments on public datasets to verify the effectiveness of our framework.


Large-scale Multi-view Subspace Clustering in Linear Time

arXiv.org Machine Learning

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario.


CNAK : Cluster Number Assisted K-means

arXiv.org Machine Learning

Determining the number of clusters present in a dataset is an important problem in cluster analysis. Conventional clustering techniques generally assume this parameter to be provided up front. In this paper, we propose a method which analyzes cluster stability for predicting the cluster number. Under the same computational framework, the technique also finds representatives of the clusters. The method is apt for handling big data, as we design the algorithm using Monte-Carlo simulation. Also, we explore a few pertinent issues found to be of also clustering. Experiments reveal that the proposed method is capable of identifying a single cluster. It is robust in handling high dimensional dataset and performs reasonably well over datasets having cluster imbalance. Moreover, it can indicate cluster hierarchy, if present. Overall we have observed significant improvement in speed and quality for predicting cluster numbers as well as the composition of clusters in a large dataset. Keywords: k-means clustering, Bipartite graph, Perfect Matching, Kuhn-Munkres Algorithm, Monte Carlo simulation. 1. Introduction In cluster analysis, it is required to group a set of data points in a multidimensional space, so that data points in the same group are more similar to each other than to those in other groups. These groups are called clusters. Various distance functions may be used to compute the degree of similarity or dissimilarity among these data points. Typically Euclidean distance function is widely used in clustering. The aim of this unsupervised technique is to increase homogeneity in a group and heterogeneity between groups. Several clustering methods with different characteristics have been proposed for different purposes. Some well-known methods include partition-based clustering [26], hierarchical clustering [25], spectral clustering [27], density-based clustering [12]. However, they require the knowledge of cluster number for a given dataset a priori [12, 21, 26, 27, 36].


How to Train a Machine Learning Model in JASP: Clustering - JASP - Free and User-Friendly Statistical Software

#artificialintelligence

This is a continuation of our series on machine learning methods that have been implemented in JASP (version 0.11 onwards). In this blog post we train a machine learning model to find clusters within our data set. The goal of a clustering task is to detect structures in the data. To do so, the algorithm needs to (1) identify the number of structures/groups in the data, and (2) figure out how the features are distributed in each group. For instance, clustering can be used to detect subgenres in electronic music, subgroups in a customer database, or to identify areas where there are greater incidences of particular types of crime.


Gromov-Wasserstein Factorization Models for Graph Clustering

arXiv.org Machine Learning

We propose a new nonlinear factorization model for graphs that are with topological structures, and optionally, node attributes. This model is based on a pseudometric called Gromov-Wasserstein (GW) discrepancy, which compares graphs in a relational way. It estimates observed graphs as GW barycenters constructed by a set of atoms with different weights. By minimizing the GW discrepancy between each observed graph and its GW barycenter-based estimation, we learn the atoms and their weights associated with the observed graphs. The model achieves a novel and flexible factorization mechanism under GW discrepancy, in which both the observed graphs and the learnable atoms can be un-aligned and with different sizes. We design an effective approximate algorithm for learning this Gromov-Wasserstein factorization (GWF) model, unrolling loopy computations as stacked modules and computing gradients with backpropaga-tion. The stacked modules can be with two different architectures, which correspond to the proximal point algorithm (PP A) and Bregman alternating direction method of multipliers (BADMM), respectively. Experiments show that our model obtains encouraging results on clustering graphs. Introduction As an important methodology for machine learning, factorization models explore intrinsic structures of high-dimensional observations explicitly, which have been widely used in many learning tasks, e.g., data clustering (Ng, Jordan, and Weiss 2002), dimensionality reduction (Cand es et al. 2011), recommendation systems (Wang and Blei 2011), etc. In particular, factorization models decompose high-dimensional observations into a set of atoms under specific criteria and achieve their latent representations accordingly.


Deep Unsupervised Clustering with Clustered Generator Model

arXiv.org Machine Learning

However, unsupervised clustering remains one of the most fundamental challenges in machine learning because of high dimensionality of data and high complexities of their hidden structures. Long-established approaches for unsupervised clustering including K-means [15] and Gaussian Mixture Model (GMM) [3] are still the building blocks for numerous applications due to their efficiency and simplicity. However, their distance metrics are limited to data space, making them ineffective for high-dimensional data such as images. Therefore, considerable efforts have been put into obtaining a good feature embedding of data, usually of low dimensionality, for effective clustering [37]. However, the representation obtained by standalone data embedding typically can-Tian Han is the corresponding author not capture the latent structure and variation of the observed data which may be ineffective for clustering. We believe the good representation for clustering should also be able to compactly represent the observed data distribution to encode all necessary characteristics of the observation. Deep generative models (a.k.a the generator models) have shown great promise in learning latent representations for high-dimensional signals such as images and videos [32, 24, 11]. Generator models parameterized by deep neural networks specify a nonlinear mapping from latent variables to observed data.