Clustering
A Short Survey on Data Clustering Algorithms
With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology; Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end.
Maximum Likelihood Estimation for Single Linkage Hierarchical Clustering
Zhu, Dekang, Guralnik, Dan P., Wang, Xuezhi, Li, Xiang, Moran, Bill
Clustering is a common technique for statistical data analysis, widely used in data mining, machine learning, pattern recognition, image analysis, bioinformatics and cyber security. Conventional ("flat", "hard") clustering methods accept a finite metric space (O, d) as input and return a partition of O as their output. Hierarchical clustering (HC) methods have a different philosophy: their output is an entire hierarchy of partitions, called a dendrogram, capable of exhibiting multi-scale structure in the data set [1, 2]. Rather than fixing the required number of clusters in advance, as is common for many flat clustering algorithms, it is more informative to furnish a hierarchy of clusters, providing an opportunity to choose a partition at a scale most natural for the context of the task at hand. Many HC methods require linkage functions to provide a measure of dissimilarity between clusters (see [3, 4] for a fairly recent review). Some commonly used linkage functions are single linkage, complete linkage, average linkage, etc. The SLHC method, though suffering from the so called "chaining effect", remains popular for large scale applications [5] because of the low complexity of implementing it using minimum spanning trees (MST) [6].
Tree-Guided MCMC Inference for Normalized Random Measure Mixture Models
Normalized random measures (NRMs) provide a broad class of discrete random measures that are often used as priors for Bayesian nonparametric models. Dirichlet process is a well-known example of NRMs. Most of posterior inference methods for NRM mixture models rely on MCMC methods since they are easy to implement and their convergence is well studied. However, MCMC often suffers from slow convergence when the acceptance rate is low. Tree-based inference is an alternative deterministic posterior inference method, where Bayesian hierarchical clustering (BHC) or incremental Bayesian hierarchical clustering (IBHC) have been developed for DP or NRM mixture (NRMM) models, respectively. Although IBHC is a promising method for posterior inference for NRMM models due to its efficiency and applicability to online inference, its convergence is not guaranteed since it uses heuristics that simply selects the best solution after multiple trials are made. In this paper, we present a hybrid inference algorithm for NRMM models, which combines the merits of both MCMC and IBHC. Trees built by IBHC outlines partitions of data, which guides Metropolis-Hastings procedure to employ appropriate proposals. Inheriting the nature of MCMC, our tree-guided MCMC (tgMCMC) is guaranteed to converge, and enjoys the fast convergence thanks to the effective proposals guided by trees. Experiments on both synthetic and real-world datasets demonstrate the benefit of our method.
Fast clustering for scalable statistical analysis on structured images
Thirion, Bertrand, Hoyos-Idrobo, Andrรฉs, Kahn, Jonas, Varoquaux, Gael
The use of brain images as markers for diseases or behavioral differences is challenged by the small effects size and the ensuing lack of power, an issue that has incited researchers to rely more systematically on large cohorts. Coupled with resolution increases, this leads to very large datasets. A striking example in the case of brain imaging is that of the Human Connectome Project: 20 Terabytes of data and growing. The resulting data deluge poses severe challenges regarding the tractability of some processing steps (discriminant analysis, multivariate models) due to the memory demands posed by these data. In this work, we revisit dimension reduction approaches, such as random projections, with the aim of replacing costly function evaluations by cheaper ones while decreasing the memory requirements. Specifically, we investigate the use of alternate schemes, based on fast clustering, that are well suited for signals exhibiting a strong spatial structure, such as anatomical and functional brain images. Our contribution is twofold: i) we propose a linear-time clustering scheme that bypasses the percolation issues inherent in these algorithms and thus provides compressions nearly as good as traditional quadratic-complexity variance-minimizing clustering schemes, ii) we show that cluster-based compression can have the virtuous effect of removing high-frequency noise, actually improving subsequent estimations steps. As a consequence, the proposed approach yields very accurate models on several large-scale problems yet with impressive gains in computational efficiency, making it possible to analyze large datasets.
Co-Clustering Network-Constrained Trajectory Data
Mahrsi, Mohamed Khalil El, Guigourรจs, Romain, Rossi, Fabrice, Boullรฉ, Marc
Recently, clustering moving object trajectories kept gaining interest from both the data mining and machine learning communities. This problem, however, was studied mainly and extensively in the setting where moving objects can move freely on the euclidean space. In this paper, we study the problem of clustering trajectories of vehicles whose movement is restricted by the underlying road network. We model relations between these trajectories and road segments as a bipartite graph and we try to cluster its vertices. We demonstrate our approaches on synthetic data and show how it could be useful in inferring knowledge about the flow dynamics and the behavior of the drivers using the road network.
Comparing Clustering Approaches for Modeling Players' Values through Avatar Construction
Lim, Chong-U (Massachusetts Institute of Technology) | Harrell, D. Fox (Massachusetts Institute of Technology)
Videogame avatars provide an expressive avenue for players to represent themselves virtually. Research has shown that these avatars, while virtual, can reveal aspects of players' identities, along with physical, social, and cultural values of the real-world. In this paper, we present an approach for modeling player values through their avatars using artificial intelligence (AI) clustering techniques. In a study with 191 participants who created avatars using our system, we provide a thorough comparison of the techniques across numerical, textual, and visual data. Our findings showed that these data structures can effectively reveal players' values and preferences, such as conforming to stereotypes of character roles using statistical attributes, modeling nuances in text descriptions of avatars, and identifying "best-example" (prototypical) avatar appearances that players can be quantitatively shown to conform to. Our findings suggest that AI clustering approaches can be used to model players to yield insight into implicitly held values in a data-driven manner through virtual avatars.
Large-Scale Cross-Game Player Behavior Analysis on Steam
Sifa, Rafet (Fraunhofer IAIS) | Drachen, Anders (Aalborg University) | Bauckhage, Christian (Fraunhofer IAIS)
Behavioral game analytics has predominantly been confined to work on single games, which means that the cross-game applicability of current knowledge remains largely unknown. Here four experiments are presented focusing on the relationship between game ownership, time invested in playing games, and the players themselves, across more than 3000 games distributed by the Steam platform and over 6 million players, covering a total playtime of over 5 billion hours. Experiments are targeted at uncovering high-level patterns in the behavior of players focusing on playtime, using frequent itemset mining on game ownership, cluster analysis to develop playtime-dependent player profiles, correlation between user game rankings and, review scores, playtime and game ownership, as well as cluster analysis on Steam games. Within the context of playtime, the analyses presented provide unique insights into the behavior of game players as they occur across games, for example in how players distribute their time across games.
Bayesian Clustering of Player Styles for Multiplayer Games
Normoyle, Aline (University of Pennsylvania) | Jensen, Shane T. (The Wharton School, University of Pennsylvania)
Clustering is an essential game analysis tool for understanding There are many clustering procedures that could be used player strengths and preferences. For example, clustering to group players based upon their play styles, with k-means techniques have been used to identify player preferences clustering being the most common method. Our use of for using vehicles over direct combat (Drachen et al. 2012), a model-based semi-parametric Bayesian clustering procedure for taking time to solve puzzles over running through content has two important advantages. First, the number of (Drachen, Canossa, and Yannakakis 2009), for understanding clusters (unique player styles) does not have to be prespecified.
From "In" to "Over": Behavioral Experiments on Whole-Network Computation
Dworkin, Lili (University of Pennsylvania) | Kearns, Michael (University of Pennsylvania)
We report on a series of behavioral experiments in human computation on three different tasks over networks: graph coloring, community detection (or graph clustering), and competitive contagion. While these tasks share similar action spaces and interfaces, they capture a diversity of computational challenges: graph coloring is a search problem, clustering is an optimization problem, and competitive contagion is a game-theoretic problem. In contrast with most of the prior literature on human-subject experiments in networks, in which collectives of subjects are embedded "in" the network, and have only local information and interactions, here individual subjects have a global (or "over") view and must solve "whole network" problems alone. Our primary findings are that subject performance is impressive across all three problem types; that subjects find diverse and novel strategies for solving each task; and that collective performance can often be strongly correlated with known algorithms.
Clustering With Side Information: From a Probabilistic Model to a Deterministic Algorithm
Khashabi, Daniel, Wieting, John, Liu, Jeffrey Yufei, Liang, Feng
In this paper, we propose a model-based clustering method (TVClust) that robustly incorporates noisy side information as soft-constraints and aims to seek a consensus between side information and the observed data. Our method is based on a nonparametric Bayesian hierarchical model that combines the probabilistic model for the data instance and the one for the side-information. An efficient Gibbs sampling algorithm is proposed for posterior inference. Using the small-variance asymptotics of our probabilistic model, we then derive a new deterministic clustering algorithm (RDP-means). It can be viewed as an extension of K-means that allows for the inclusion of side information and has the additional property that the number of clusters does not need to be specified a priori. Empirical studies have been carried out to compare our work with many constrained clustering algorithms from the literature on both a variety of data sets and under a variety of conditions such as using noisy side information and erroneous k values. The results of our experiments show strong results for our probabilistic and deterministic approaches under these conditions when compared to other algorithms in the literature.