Clustering
Incremental Cluster Validity Indices for Hard Partitions: Extensions and Comparative Study
da Silva, Leonardo Enzo Brito, Melton, Niklas M., Wunsch, Donald C. II
V alidation is one of the most important aspects of clustering, but most approaches have been batch methods. Recently, interest has grown in providing incremental alternatives. This paper extends the incremental cluster validity index (iCVI) family to include incremental versions of Calinski-Harabasz (iCH), I index and Pakhira-Bandyopadhyay-Maulik (iI and iPBM), Silhouette (iSIL), Negentropy Increment (iNI), Representative Cross Information Potential (irCIP) and Representative Cross Entropy (irH), and Conn Index (iConn Index). Additionally, the effect of under-and over-partitioning on the behavior of these six iCVIs, the Partition Separation (PS) index, as well as two other recently developed iCVIs (incremental Xie-Beni (iXB) and incremental Davies-Bouldin (iDB)) was examined through a comparative study. Experimental results using fuzzy adaptive resonance theory (ART)-based clustering methods showed that while evidence of most under-partitioning cases could be inferred from the behaviors of all these iCVIs, over-partitioning was found to be a more challenging scenario indicated only by the iConn Index. The expansion of incremental validity indices provides significant novel opportunities for assessing and interpreting the results of unsupervised learning. L. E. Brito da Silva is with the Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409 USA, and also with the CAPES Foundation, Ministry of Education of Brazil, Bras ılia, DF 70040-020, Brazil (email: leonardoenzo@ieee.org). N. M. Melton is with the Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409 USA (email: niklasmelton@ieee.org). D. C. Wunsch II is with the Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409 USA (email: wunsch@ieee.org). I NTRODUCTION Cluster validation [1] is a critical topic in cluster analysis.
Going deep in clustering high-dimensional data: deep mixtures of unigrams for uncovering topics in textual data
Anderlucci, Laura, Viroli, Cinzia
They can be basically defined as a multi-layer stack of algorithms or modules able to gradually learn a huge number of parameters in an architecture composed by multiple nonlinear transformations (LeCun et al., 2015). Typically, and for historical reasons, a structure for deep learning is identified with advanced neural networks: deep Feed Forward, Recurrent, Auto-encoder, Convolution neural networks are very effective and used algorithms of deep learning (Schmidhuber, 2015). They demonstrated to be particularly successful in supervised classification problems arising in several fields such as image and speech recognition, gene expression data, topic classification. When the aim is uncovering unknown classes in a unsupervised classification perspective, important methods of deep learning have been developed along the lines of mixture modeling, because of their ability to decompose a heterogeneous collection of units into a finite number of subgroups with homogeneous structures (Fraley and Raftery, 2002; McLachlan and Peel, 2000). In this direction, van den Oord and Schrauwen (2014) proposed Multilayer Gaussian Mixture Models for modeling natural images; Tang et al. (2012) defined deep mixture of factor analyzers with a greedy layer-wise learning algorithm able to learn each layer at a time. Viroli and McLachlan (2019) developed a general framework for Deep Gaussian mixture models that generalizes and encompasses the previous strategies and several flexible model-based clustering methods such as mixtures of mixture models (Li, 2005), mixtures of Factor Analyzers (McLachlan et al., 2003), mixtures of factor analyzers with common factor loadings (Baek et al., 2010), heteroscedastic factor mixture analysis (Montanari and Viroli, 2010) and mixtures of factor mixture analyzers introduced by Viroli (2010). A general'take-home-message' coming from the existing deep clustering strategies is that deep methods vs shallow ones appear to be very efficient and powerful tools especially for complex high-dimensional data; on the contrary, for simple and small data structures, a deep learning strategy cannot improve performance of simpler and conventional methods or, to better say, it is like to use a'sledgehammer to crack a nut'. The motivating problem behind this work derives from ticket data (i.e.
Context-Based Dynamic Pricing with Online Clustering
Miao, Sentao, Chen, Xi, Chao, Xiuli, Liu, Jiaxi, Zhang, Yidong
We consider a context-based dynamic pricing problem of online products which have low sales. Sales data from Alibaba, a major global online retailer, illustrate the prevalence of low-sale products. For these products, existing single-product dynamic pricing algorithms do not work well due to insufficient data samples. To address this challenge, we propose pricing policies that concurrently perform clustering over products and set individual pricing decisions on the fly. By clustering data and identifying products that have similar demand patterns, we utilize sales data from products within the same cluster to improve demand estimation and allow for better pricing decisions. We evaluate the algorithms using the regret, and the result shows that when product demand functions come from multiple clusters, our algorithms significantly outperform traditional single-product pricing policies. Numerical experiments using a real dataset from Alibaba demonstrate that the proposed policies, compared with several benchmark policies, increase the revenue. The results show that online clustering is an effective approach to tackling dynamic pricing problems associated with low-sale products. Our algorithms were further implemented in a field study at Alibaba with 40 products for 30 consecutive days, and compared to the products which use business-as-usual pricing policy of Alibaba. The results from the field experiment show that the overall revenue increased by 10.14%.
The 5 Clustering Algorithms Data Scientists Need to Know
Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm.
A Probabilistic framework for Quantum Clustering
Casaña-Eslava, Raúl V., Lisboa, Paulo J. G., Ortega-Martorell, Sandra, Jarman, Ian H., Martín-Guerrero, José D.
Quantum Clustering is a powerful method to detect clusters in data with mixed density. However, it is very sensitive to a length parameter that is inherent to the Schr\"odinger equation. In addition, linking data points into clusters requires local estimates of covariance that are also controlled by length parameters. This raises the question of how to adjust the control parameters of the Schr\"odinger equation for optimal clustering. We propose a probabilistic framework that provides an objective function for the goodness-of-fit to the data, enabling the control parameters to be optimised within a Bayesian framework. This naturally yields probabilities of cluster membership and data partitions with specific numbers of clusters. The proposed framework is tested on real and synthetic data sets, assessing its validity by measuring concordance with known data structure by means of the Jaccard score (JS). This work also proposes an objective way to measure performance in unsupervised learning that correlates very well with JS.
Deep Divergence-Based Approach to Clustering
Kampffmeyer, Michael, Løkse, Sigurd, Bianchi, Filippo M., Livi, Lorenzo, Salberg, Arnt-Børre, Jenssen, Robert
A promising direction in deep learning research consists in learning representations and simultaneously discovering cluster structure in unlabeled data by optimizing a discriminative loss function. As opposed to supervised deep learning, this line of research is in its infancy, and how to design and optimize suitable loss functions to train deep neural networks for clustering is still an open question. Our contribution to this emerging field is a new deep clustering network that leverages the discriminative power of information-theoretic divergence measures, which have been shown to be effective in traditional clustering. We propose a novel loss function that incorporates geometric regularization constraints, thus avoiding degenerate structures of the resulting clustering partition. Experiments on synthetic benchmarks and real datasets show that the proposed network achieves competitive performance with respect to other state-of-the-art methods, scales well to large datasets, and does not require pre-training steps.
Infinite Mixture Prototypes for Few-Shot Learning
Allen, Kelsey R., Shelhamer, Evan, Shin, Hanul, Tenenbaum, Joshua B.
We propose infinite mixture prototypes to adaptively represent both simple and complex data distributions for few-shot learning. Our infinite mixture prototypes represent each class by a set of clusters, unlike existing prototypical methods that represent each class by a single cluster. By inferring the number of clusters, infinite mixture prototypes interpolate between nearest neighbor and prototypical representations, which improves accuracy and robustness in the few-shot regime. We show the importance of adaptive capacity for capturing complex data distributions such as alphabets, with 25% absolute accuracy improvements over prototypical networks, while still maintaining or improving accuracy on the standard Omniglot and mini-ImageNet benchmarks. In clustering labeled and unlabeled data by the same clustering rule, infinite mixture prototypes achieves state-of-the-art semi-supervised accuracy. As a further capability, we show that infinite mixture prototypes can perform purely unsupervised clustering, unlike existing prototypical methods.
Nearest Neighbor Median Shift Clustering for Binary Data
Beck, Gaël, Duong, Tarn, Lebbah, Mustapha, Azzag, Hanane
We describe in this paper the theory and practice behind a new modal clustering method for binary data. Our approach (BinNNMS) is based on the nearest neighbor median shift. The median shift is an extension of the well-known mean shift, which was designed for continuous data, to handle binary data. We demonstrate that BinNNMS can discover accurately the location of clusters in binary data with theoretical and experimental analyses.
A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering
Beck, Gaël, Duong, Tarn, Lebbah, Mustapha, Azzag, Hanane, Cérin, Christophe
In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.
Feature Concatenation Multi-view Subspace Clustering
Zheng, Qinghai, Zhu, Jihua, Li, Zhongyu, Pang, Shanmin, Wang, Jun
The consensus information and complementary information of multi-view data ensure the success of multi-view clustering. Since statistic properties of different views are diverse, even incompatible, few approaches directly implement multi-view clustering based on concatenated features. This paper proposes a novel multi-view subspace clustering approach dubbed Feature Concatenation Multi-view Subspace Clustering (FCMSC), which utilizes the joint view representation of multi-view data so as to leverage both the consensus and complementary information for clustering. Specifically, multi-view data are firstly concatenated into one matrix, which is used to derive a special coefficient matrix enjoying the low-rank property. Then, $l_{2,1}$-norm is integrated into the objective function to deal with the sample-specific and cluster-specific corruptions of multiple views for benefiting the clustering performance. It is noteworthy that the obtained coefficient matrix is not derived by simply applying the Low-Rank Representation (LRR) to the joint view representation. What's more, a novel algorithm based on the Augmented Lagrangian Multiplier (ALM) is designed to optimize the object function. Finally, the spectral clustering algorithm is applied to an adjacency matrix calculated from the coefficient matrix. Comprehensive experiments on six real world datasets illustrate its superiority over several state-of-the-art approaches for multi-view clustering.