Clustering
Clustering by Optimizing the Average Silhouette Width
Batool, Fatima, Hennig, Christian
In this paper, we propose a unified clustering approach that can estimate number of clusters and produce clustering against this number simultaneously. Average silhouette width (ASW) is a widely used standard cluster quality index. We define a distance based objective function that optimizes ASW for clustering. The proposed algorithm named as OSil, only, needs data observations as an input without any prior knowledge of the number of clusters. This work is about thorough investigation of the proposed methodology, its usefulness and limitations. A vast spectrum of clustering structures were generated, and several well-known clustering methods including partitioning, hierarchical, density based, and spatial methods were consider as the competitor of the proposed methodology. Simulation reveals that OSil algorithm has shown superior perform in terms of clustering quality than all clustering methods included in the study. OSil can find well separated, compact clusters and have shown better performance for the estimation of number of clusters than several methods. Apart from the proposal of the new methodology and it's investigation this papers offer a systematic analysis on the estimation of cluster indices, some of which never appeared together in comparative simulation setup before. The study offers many insightful findings useful for the selection of the clustering methods and indices.
Predicting passenger origin-destination in online taxi-hailing systems
Golshanrad, Pouria, Mahini, Hamid, Bahrak, Behnam
Because of transportation planning, traffic management and dispatch optimization importance, the passenger origin-destination prediction has become one of the most important requirements for intelligent transportation systems management. In this paper, we propose a model to predict the origin and destination of travels which will occur in the next specified time window. In order to extract meaningful travel flows we use K-means clustering in four-dimensional space with maximum cluster size limitation for origin and destination. Because of large number of clusters, we use non-negative matrix factorization to decrease the number of travel clusters. We also use a stacked recurrent neural network model to predict travels count in each cluster. Comparing our results with other existing models show that our proposed model has 5-7% lower mean absolute percentage error (MAPE) for 1-hour time window, and 14% lower MAPE for 30-minute time window.
Deep clustering with concrete k-means
Gao, Boyan, Yang, Yongxin, Gouk, Henry, Hospedales, Timothy M.
ABSTRACT W e address the problem of simultaneously learning a k -means clustering and deep feature representation from unlabelle d data, which is of interest due to the potential of deep k -means to outperform traditional two-step feature extraction and shallow-clustering strategies. W e achieve this by develop ing a gradient-estimator for the non-differentiable k -means objective via the Gumbel-Softmax reparameterisation trick. In contrast to previous attempts at deep clustering, our concr ete k -means model can be optimised with respect to the canonical k -means objective and is easily trained end-to-end without resorting to alternating optimisation. W e demonstrate the efficacy of our method on standard clustering benchmarks. Index T erms-- Deep Clustering, Unsupervised Learning, Gradient Estimator 1. INTRODUCTION Clustering is a fundamental task in unsupervised machine learning, and one with numerous applications.
A Unified Framework for Tuning Hyperparameters in Clustering Problems
Fan, Xinjie, Yue, Yuguang, Sarkar, Purnamrita, Wang, Y. X. Rachel
Selecting hyperparameters for unsupervised learning problems is difficult in general due to the lack of ground truth for validation. However, this issue is prevalent in machine learning, especially in clustering problems with examples including the Lagrange multipliers of penalty terms in semidefinite programming (SDP) relaxations and the bandwidths used for constructing kernel similarity matrices for Spectral Clustering. Despite this, there are not many provable algorithms for tuning these hyperparameters. In this paper, we provide a unified framework with provable guarantees for the above class of problems. We demonstrate our method on two distinct models. First, we show how to tune the hyperparameters in widely used SDP algorithms for community detection in networks. In this case, our method can also be used for model selection. Second, we show the same framework works for choosing the bandwidth for the kernel similarity matrix in Spectral Clustering for subgaussian mixtures under suitable model specification. In a variety of simulation experiments, we show that our framework outperforms other widely used tuning procedures in a broad range of parameter settings.
Mixture-of-Experts Variational Autoencoder for clustering and generating from similarity-based representations
Kopf, Andreas, Fortuin, Vincent, Somnath, Vignesh Ram, Claassen, Manfred
Clustering high-dimensional data, such as images or biological measurements, is a long-standing problem and has been studied extensively. Recently, Deep Clustering gained popularity due to the non-linearity of neural networks, which allows for flexibility in fitting the specific peculiarities of complex data. Here we introduce the Mixture-of-Experts Similarity Variational Autoencoder (MoE-Sim-VAE), a novel generative clustering model. The model can learn multi-modal distributions of high-dimensional data and use these to generate realistic data with high efficacy and efficiency. MoE-Sim-VAE is based on a Variational Autoencoder (VAE), where the decoder consists of a Mixture-of-Experts (MoE) architecture. This specific architecture allows for various modes of the data to be automatically learned by means of the experts. Additionally, we encourage the latent representation of our model to follow a Gaussian mixture distribution and to accurately represent the similarities between the data points. We assess the performance of our model on synthetic data, the MNIST benchmark data set, and a challenging real-world task of defining cell subpopulations from mass cytometry (CyTOF) measurements on hundreds of different datasets. MoE-Sim-VAE exhibits superior clustering performance on all these tasks in comparison to the baselines and we show that the MoE architecture in the decoder reduces the computational cost of sampling specific data modes with high fidelity.
Creating a k-means clustering model BigQuery ML Google Cloud
Enter the following standard SQL query in the Query editor text area. When the query is complete, click the Results tab below the query text area. The results tab shows the columns you queried that are used to train your model: station_name, duration, num_trips, distance_from_city_center. The results should look like the following. Now that you have examined your training data, the next step is to create a k-means model using the data. You can create and train a k-means model using the CREATE MODEL statement with the option model_type kmeans.
Customer Segmentation Using K Means Clustering - WebSystemer.no
I started with loading all the libraries and dependencies. The columns in the dataset are customer id, gender, age, income and spending score. I dropped the id column as that does not seem relevant to the context. Also I plotted the age frequency of customers. Next I made a box plot of spending score and annual income to better visualize the distribution range. The range of spending score is clearly more than the annual income range.
FISHDBC: Flexible, Incremental, Scalable, Hierarchical Density-Based Clustering for Arbitrary Data and Distance
FISHDBC is a flexible, incremental, scalable, and hierarchical density-based clustering algorithm. It is flexible because it empowers users to work on arbitrary data, skipping the feature extraction step that usually transforms raw data in numeric arrays letting users define an arbitrary distance function instead. It is incremental and scalable: it avoids the $\mathcal O(n^2)$ performance of other approaches in non-metric spaces and requires only lightweight computation to update the clustering when few items are added. It is hierarchical: it produces a "flat" clustering which can be expanded to a tree structure, so that users can group and/or divide clusters in sub- or super-clusters when data exploration requires so. It is density-based and approximates HDBSCAN*, an evolution of DBSCAN.
A Notion of Harmonic Clustering in Simplicial Complexes
Ebli, Stefania, Spreemann, Gard
We outline a novel clustering scheme for simplicial complexes that produces clusters of simplices in a way that is sensitive to the homology of the complex. The method is inspired by, and can be seen as a higher-dimensional version of, graph spectral clustering. The algorithm involves only sparse eigenproblems, and is therefore computationally efficient. We believe that it has broad application as a way to extract features from simplicial complexes that often arise in topological data analysis.
The Local Elasticity of Neural Networks
This paper presents a phenomenon in neural networks that we refer to as \textit{local elasticity}. Roughly speaking, a classifier is said to be locally elastic if its prediction at a feature vector $\bx'$ is \textit{not} significantly perturbed, after the classifier is updated via stochastic gradient descent at a (labeled) feature vector $\bx$ that is \textit{dissimilar} to $\bx'$ in a certain sense. This phenomenon is shown to persist for neural networks with nonlinear activation functions through extensive simulations on real-life and synthetic datasets, whereas this is not observed in linear classifiers. In addition, we offer a geometric interpretation of local elasticity using the neural tangent kernel \citep{jacot2018neural}. Building on top of local elasticity, we obtain pairwise similarity measures between feature vectors, which can be used for clustering in conjunction with $K$-means. The effectiveness of the clustering algorithm on the MNIST and CIFAR-10 datasets in turn corroborates the hypothesis of local elasticity of neural networks on real-life data. Finally, we discuss some implications of local elasticity to shed light on several intriguing aspects of deep neural networks.