Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)
Learning the number of clusters is a key problem in data clustering. We present dip-means, a novel robust incremental method to learn the number of data clusters that may be used as a wrapper around any iterative clustering algorithm of the k-means family. In contrast to many popular methods which make assumptions about the underlying cluster distributions, dip-means only assumes a fundamental cluster property: each cluster to admit a unimodal distribution. The proposed algorithm considers each cluster member as a ''viewer'' and applies a univariate statistic hypothesis test for unimodality (dip-test) on the distribution of the distances between the viewer and the cluster members. Two important advantages are: i) the unimodality test is applied on univariate distance vectors, ii) it can be directly applied with kernel-based methods, since only the pairwise distances are involved in the computations.
The last decades have not only been characterized by an explosive growth of data, but also an increasing appreciation of data as a valuable resource. It's value comes with the ability to extract meaningful patterns that are of economic, societal or scientific relevance. A particular challenge is to identify patterns across time, including patterns that might only become apparent when the temporal dimension is taken into account. Here, we present a novel method that aims to achieve this by detecting dynamic clusters, i.e. structural elements that can be present over prolonged durations. It is based on an adaptive identification of majority overlaps between groups at different time points and allows the accommodation of transient decompositions in otherwise persistent dynamic clusters. As such, our method enables the detection of persistent structural elements with internal dynamics and can be applied to any classifiable data, ranging from social contact networks to arbitrary sets of time stamped feature vectors. It provides a unique tool to study systems with non-trivial temporal dynamics with a broad applicability to scientific, societal and economic data.
One key use of k-means clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using k-means cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data distribution, as these points do not mimic the distribution of the data. To this end, we propose a new clustering method called "distributional clustering", which ensures cluster centers capture the distribution of the underlying data. We first prove the asymptotic convergence of the proposed cluster centers to the data generating distribution, then present an efficient algorithm for computing these cluster centers in practice. Finally, we demonstrate the effectiveness of distributional clustering on synthetic and real datasets.
Based on the online transaction data of COSCO group's centralized procurement platform, this paper studies the clustering method of time series type data. The different methods of similarity calculation, different clustering methods with different K values are analysed, and the best clustering method suitable for centralized purchasing data is determined. The company list under the corresponding cluster is obtained. The time series motif discovery algorithm is used to model the centroid of each cluster. Through ARIMA method, we also made 12 periods of prediction for the centroid of each category. This paper constructs a matrix of "Customer Lifecycle Theory - Five Elements of Marketing ", and puts forward corresponding marketing suggestions for customers at different life cycle stages.
Now we have the probability that each data point belongs to each cluster. If we need hard cluster assignments, we can just choose for each data point to belong to the cluster with the highest probability. But the nice thing about EM is that we can embrace the fuzziness of the cluster membership. We can look at a data point and consider the fact that while it most likely belongs to Cluster B, it's also quite likely to belong to Cluster D. This also takes into account the fact that there may not be clear cut boundaries between our clusters. These groups consist of overlapping multi-dimensional distributions, so drawing clear cut lines might not always be the best solution.
Bagging and boosting are proved to be the best methods of building multiple classifiers in classification combination problems. In the area of "flat clustering" problems, it is also recognized that multi-clustering methods based on boosting provide clusterings of an improved quality. In this paper, we introduce a novel multi-clustering method for "hierarchical clusterings" based on boosting theory, which creates a more stable hierarchical clustering of a dataset. The proposed algorithm includes a boosting iteration in which a bootstrap of samples is created by weighted random sampling of elements from the original dataset. A hierarchical clustering algorithm is then applied to selected subsample to build a dendrogram which describes the hierarchy. Finally, dissimilarity description matrices of multiple dendrogram results are combined to a consensus one, using a hierarchical-clustering-combination approach. Experiments on real popular datasets show that boosted method provides superior quality solutions compared to standard hierarchical clustering methods.
When our data is relatively clean and low-dimensional, looking at a table of summary statistics or some scatter plots can usually reveal how good clustering would be on the data. Look for things like large'clumps' of points in scatter plots between features, large variances, large differences between median and mean, properties of data between quantiles etc.
The best-known optimization clustering algorithm is k-means clustering. Unlike hierarchical clustering methods that require processing time proportional to the square or cube of the number of observations, the time required by the k-means algorithm is proportional to the number of observations. This means that k-means clustering can be used on larger data sets. A set of points known as seeds is selected as a first guess of the means of the final clusters. These seeds are typically selected from the sample data.
Social media provide a low-cost alternative source for public health surveillance and health-related classification plays an important role to identify useful information. We summarized the recent classification methods using social media in public health. These methods rely on bag-of-words (BOW) model and have difficulty grasping the semantic meaning of texts. Unlike these methods, we present a word embedding based clustering method. Word embedding is one of the strongest trends in Natural Language Processing (NLP) at this moment.
In this work, we adopt an unsupervised learning approach, k-means clustering, to analyze the arterial traffic flow data over a high-dimensional spatio-temporal feature space. As part of the adaptive traffic control system deployed around the East Liberty area in Pittsburgh, high-resolution traffic occupancy and count data are available at the lane level in virtually any time resolution. The k-means clustering method is used to analyze those data to understand the traffic patterns before and after the closure and reopening of an arterial bridge. The modeling framework also holds great potentials for predicting traffic flow and detect incidents. The main findings are that clustering on high-dimensional spatio-temporal features can effectively distinguish flow patterns before and after road closure and reopening and between weekends and weekdays.