Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)
Bagging and boosting are proved to be the best methods of building multiple classifiers in classification combination problems. In the area of "flat clustering" problems, it is also recognized that multi-clustering methods based on boosting provide clusterings of an improved quality. In this paper, we introduce a novel multi-clustering method for "hierarchical clusterings" based on boosting theory, which creates a more stable hierarchical clustering of a dataset. The proposed algorithm includes a boosting iteration in which a bootstrap of samples is created by weighted random sampling of elements from the original dataset. A hierarchical clustering algorithm is then applied to selected subsample to build a dendrogram which describes the hierarchy. Finally, dissimilarity description matrices of multiple dendrogram results are combined to a consensus one, using a hierarchical-clustering-combination approach. Experiments on real popular datasets show that boosted method provides superior quality solutions compared to standard hierarchical clustering methods.
When our data is relatively clean and low-dimensional, looking at a table of summary statistics or some scatter plots can usually reveal how good clustering would be on the data. Look for things like large'clumps' of points in scatter plots between features, large variances, large differences between median and mean, properties of data between quantiles etc.
The best-known optimization clustering algorithm is k-means clustering. Unlike hierarchical clustering methods that require processing time proportional to the square or cube of the number of observations, the time required by the k-means algorithm is proportional to the number of observations. This means that k-means clustering can be used on larger data sets. A set of points known as seeds is selected as a first guess of the means of the final clusters. These seeds are typically selected from the sample data.
Social media provide a low-cost alternative source for public health surveillance and health-related classification plays an important role to identify useful information. We summarized the recent classification methods using social media in public health. These methods rely on bag-of-words (BOW) model and have difficulty grasping the semantic meaning of texts. Unlike these methods, we present a word embedding based clustering method. Word embedding is one of the strongest trends in Natural Language Processing (NLP) at this moment.
In this work, we adopt an unsupervised learning approach, k-means clustering, to analyze the arterial traffic flow data over a high-dimensional spatio-temporal feature space. As part of the adaptive traffic control system deployed around the East Liberty area in Pittsburgh, high-resolution traffic occupancy and count data are available at the lane level in virtually any time resolution. The k-means clustering method is used to analyze those data to understand the traffic patterns before and after the closure and reopening of an arterial bridge. The modeling framework also holds great potentials for predicting traffic flow and detect incidents. The main findings are that clustering on high-dimensional spatio-temporal features can effectively distinguish flow patterns before and after road closure and reopening and between weekends and weekdays.
Cluster analysis methods are used to identify homogeneous subgroups in a data set. In biomedical applications, one frequently applies cluster analysis in order to identify biologically interesting subgroups. In particular, one may wish to identify subgroups that are associated with a particular outcome of interest. Conventional clustering methods generally do not identify such subgroups, particularly when there are a large number of high-variance features in the data set. Conventional methods may identify clusters associated with these high-variance features when one wishes to obtain secondary clusters that are more interesting biologically or more strongly associated with a particular outcome of interest. A modification of sparse clustering can be used to identify such secondary clusters or clusters associated with an outcome of interest. This method correctly identifies such clusters of interest in several simulation scenarios. The method is also applied to a large prospective cohort study of temporomandibular disorders and a leukemia microarray data set.
This paper characterizes hierarchical clustering methods that abide by two previously introduced axioms -- thus, denominated admissible methods -- and proposes tractable algorithms for their implementation. We leverage the fact that, for asymmetric networks, every admissible method must be contained between reciprocal and nonreciprocal clustering, and describe three families of intermediate methods. Grafting methods exchange branches between dendrograms generated by different admissible methods. The convex combination family combines admissible methods through a convex operation in the space of dendrograms, and thirdly, the semi-reciprocal family clusters nodes that are related by strong cyclic influences in the network. Algorithms for the computation of hierarchical clusters generated by reciprocal and nonreciprocal clustering as well as the grafting, convex combination, and semi-reciprocal families are derived using matrix operations in a dioid algebra. Finally, the introduced clustering methods and algorithms are exemplified through their application to a network describing the interrelation between sectors of the United States (U.S.) economy.
When you go to the grocery store, you see that items of a similar nature are displayed nearby to each other. When you organize the clothes in your closet, you put similar items together (e.g. Every personal organizing tip on the web to save you from your clutter suggests some sort of grouping of similar items together. Even we don't notice it, we are involved in grouping similar objects together in every aspect of our life. This is called clustering in machine learning, so in this post I will provide an overview of data mining clustering methods.
The following publication was presented at the 2014 IEEE International Workshop on Cyber Crime and received the Best Paper Award on 5/18/2014. The original IEEE LaTeX formatted PDF publication can also be downloaded from here: IWCC Combined Clustering. To be successful, cybercriminals must figure out how to scale their scams. They duplicate content on new websites, often staying one step ahead of defenders that shut down past schemes. For some scams, such as phishing and counterfeitgoods shops, the duplicated content remains nearly identical. In others, such as advanced-fee fraud and online Ponzi schemes, the criminal must alter content so that it appears different in order to evade detection by victims and law enforcement. Nevertheless, similarities often remain, in terms of the website structure or content, since making truly unique copies does not scale well. In this paper, we present a novel combined clustering method that links together replicated scam websites, even when the criminal has taken steps to hide connections. We evaluate its performance against two collected datasets of scam websites: fake-escrow services and high-yield investment programs (HYIPs). We find that our method more accurately groups similar websites together than does existing general-purpose consensus clustering methods.
Previously, we proposed a physically inspired rule to organize the data points in a sparse yet effective structure, called the in-tree (IT) graph, which is able to capture a wide class of underlying cluster structures in the datasets, especially for the density-based datasets. Although there are some redundant edges or lines between clusters requiring to be removed by computer, this IT graph has a big advantage compared with the k-nearest-neighborhood (k-NN) or the minimal spanning tree (MST) graph, in that the redundant edges in the IT graph are much more distinguishable and thus can be easily determined by several methods previously proposed by us. In this paper, we propose a general framework to re-construct the IT graph, based on an initial neighborhood graph, such as the k-NN or MST, etc, and the corresponding graph distances. For this general framework, our previous way of constructing the IT graph turns out to be a special case of it. This general framework 1) can make the IT graph capture a wider class of underlying cluster structures in the datasets, especially for the manifolds, and 2) should be more effective to cluster the sparse or graph-based datasets.