Clustering
Modal-set estimation with an application to clustering
Jiang, Heinrich, Kpotufe, Samory
We present a first procedure that can estimate -- with statistical consistency guarantees -- any local-maxima of a density, under benign distributional conditions. The procedure estimates all such local maxima, or $\textit{modal-sets}$, of any bounded shape or dimension, including usual point-modes. In practice, modal-sets can arise as dense low-dimensional structures in noisy data, and more generally serve to better model the rich variety of locally-high-density structures in data. The procedure is then shown to be competitive on clustering applications, and moreover is quite stable to a wide range of settings of its tuning parameter.
Nonparametric Modeling of Dynamic Functional Connectivity in fMRI Data
Nielsen, Søren F. V., Madsen, Kristoffer H., Røge, Rasmus, Schmidt, Mikkel N., Mørup, Morten
Dynamic functional connectivity (FC) has in recent years become a topic of interest in the neuroimaging community. Several models and methods exist for both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG), and the results point towards the conclusion that FC exhibits dynamic changes. The existing approaches modeling dynamic connectivity have primarily been based on time-windowing the data and k-means clustering. We propose a non-parametric generative model for dynamic FC in fMRI that does not rely on specifying window lengths and number of dynamic states. Rooted in Bayesian statistical modeling we use the predictive likelihood to investigate if the model can discriminate between a motor task and rest both within and across subjects. We further investigate what drives dynamic states using the model on the entire data collated across subjects and task/rest. We find that the number of states extracted are driven by subject variability and preprocessing differences while the individual states are almost purely defined by either task or rest. This questions how we in general interpret dynamic FC and points to the need for more research on what drives dynamic FC.
Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation
Srivastava, Akash, Zou, James, Sutton, Charles
A good clustering can help a data analyst to explore and understand a data set, but what constitutes a good clustering may depend on domain-specific and application-specific criteria. These criteria can be difficult to formalize, even when it is easy for an analyst to know a good clustering when she sees one. We present a new approach to interactive clustering for data exploration, called \ciif, based on a particularly simple feedback mechanism, in which an analyst can choose to reject individual clusters and request new ones. The new clusters should be different from previously rejected clusters while still fitting the data well. We formalize this interaction in a novel Bayesian prior elicitation framework. In each iteration, the prior is adapted to account for all the previous feedback, and a new clustering is then produced from the posterior distribution. To achieve the computational efficiency necessary for an interactive setting, we propose an incremental optimization method over data minibatches using Lagrangian relaxation. Experiments demonstrate that \ciif can produce accurate and diverse clusterings.
Phase Transitions for High Dimensional Clustering and Related Problems
Jin, Jiashun, Ke, Zheng Tracy, Wang, Wanjie
Consider a two-class clustering problem where we observe $X_i = \ell_i \mu + Z_i$, $Z_i \stackrel{iid}{\sim} N(0, I_p)$, $1 \leq i \leq n$. The feature vector $\mu\in R^p$ is unknown but is presumably sparse. The class labels $\ell_i\in\{-1, 1\}$ are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam's idea on comparison of experiments. We also extend the study on statistical limits for clustering to that for signal recovery and that for hypothesis testing. We compare the statistical limits for three problems and expose some interesting insight. We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold $t > 0$, IF-PCA clusters by applying classical PCA to all columns of $X$ with an $L^2$-norm larger than $t$. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering. We find an interesting phase transition for IF-PCA. Our results require delicate analysis, especially on post-selection Random Matrix Theory and on lower bound arguments.
Understanding data mining clustering methods
When you go to the grocery store, you see that items of a similar nature are displayed nearby to each other. When you organize the clothes in your closet, you put similar items together (e.g. Every personal organizing tip on the web to save you from your clutter suggests some sort of grouping of similar items together. Even we don't notice it, we are involved in grouping similar objects together in every aspect of our life. This is called clustering in machine learning, so in this post I will provide an overview of data mining clustering methods. In machine learning or data mining, clustering assigns similar objects together in order to discover structures in data that doesn't have any labels.
Robust Ensemble Clustering Using Probability Trajectories
Huang, Dong, Lai, Jian-Huang, Wang, Chang-Dong
Note that V Y L Link set of G w ij W eight between two nodes in G G K -elite neighbor graph (K -ENG) V Node set of G . Note that V Y L Link set of G w ij W eight between two nodes in G p ij (1-step) transition probability fromy i to y j P (1-step) transition probability matrix,P { p ij } N N p T ij T -step transition probability fromy i to y j P T T -step transition probability matrix,P T { p T ij } N N p T i: The i -th row ofP T, p T i: { p T i 1,···,p T i N} PT T i Probability trajectory of a random walker starting fromnode y i with lengthT PTS ij Probability trajectory based similarity betweeny i and y j R (0) Set of the initial regions for PTA,R (0) { R (0) 1,···,R (0) R (0) } S (0) Initial similarity matrix for PTA,S (0) { s (0) ij } R (0) R (0) R ( t) Set of thet -step regions for PTA, R ( t) { R ( t) 1,···,R ( t) R ( t) } S ( t) The t -step similarity matrix for PTA,S ( t) { s ( t) ij } R ( t) R ( t) G Microcluster-cluster bipartite graph (MCBG) N Number of nodes in G V Node set of G L Link set of G w ij W eight between two nodes in G A sparse graph termedK -elite neighbor graph (K -ENG) is then constructed with only a small number of probably reliable links. The ENS strategy is a crucial step in our approach. W e argue that using a small number of probably reliable links may lead to significantly better consensus results than using all graph links regardless of their reliability . The random walk process driven by a new transition probability matrix is performed on theK -ENG to explore the global structure information.
Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis
Huang, Dong, Lai, Jian-Huang, Wang, Chang-Dong
The clustering ensemble technique aims to combine multiple clusterings into a probably better and more robust clustering and has been receiving an increasing attention in recent years. There are mainly two aspects of limitations in the existing clustering ensemble approaches. Firstly, many approaches lack the ability to weight the base clusterings without access to the original data and can be affected significantly by the low-quality, or even ill clusterings. Secondly, they generally focus on the instance level or cluster level in the ensemble system and fail to integrate multi-granularity cues into a unified model. To address these two limitations, this paper proposes to solve the clustering ensemble problem via crowd agreement estimation and multigranularity link analysis. We present the normalized crowd agreement index (NCAI) to evaluate the quality of base clusterings in an unsupervised manner and thus weight the base clusterings in accordance with their clustering validity. To explore the relationship between clusters, the source aware connected triple (SACT) similarity is introduced with regard to their common neighbors and the source reliability. Present address: School of Information Science and Technology, Sun Yat-sen University, Guangzhou Higher Education Mega Center, Panyu District, Guangzhou, Guangdong, 510006, P. R. China. The experiments are conducted on eight real-world datasets. The experimental results demonstrate the effectiveness and robustness of the proposed methods. Keywords: Clustering ensemble, Clustering aggregation, Weighted evidence accumulation clustering, Graph partitioning with multi-granularity link analysis 1. Introduction Data clustering is a fundamental and very challenging problem in data mining and machine learning.
Automatic Identification of Replicated Criminal Websites Using Combined Clustering Methods
The following publication was presented at the 2014 IEEE International Workshop on Cyber Crime and received the Best Paper Award on 5/18/2014. The original IEEE LaTeX formatted PDF publication can also be downloaded from here: IWCC Combined Clustering. To be successful, cybercriminals must figure out how to scale their scams. They duplicate content on new websites, often staying one step ahead of defenders that shut down past schemes. For some scams, such as phishing and counterfeitgoods shops, the duplicated content remains nearly identical. In others, such as advanced-fee fraud and online Ponzi schemes, the criminal must alter content so that it appears different in order to evade detection by victims and law enforcement. Nevertheless, similarities often remain, in terms of the website structure or content, since making truly unique copies does not scale well. In this paper, we present a novel combined clustering method that links together replicated scam websites, even when the criminal has taken steps to hide connections. We evaluate its performance against two collected datasets of scam websites: fake-escrow services and high-yield investment programs (HYIPs). We find that our method more accurately groups similar websites together than does existing general-purpose consensus clustering methods.
Short Communication on QUIST: A Quick Clustering Algorithm
Then, it starts splitting C into smaller sub-clusters, until either one of the following conditions holds, whichever happens first: 1. The instances within each sub-cluster, c, are too similar to be divided any further, or 2. The number of clusters hits an optional upper bound provided by the user. If such bound is not provided by the user, it is assume to be equal to the number of input instances, C Additionally, splitting a given cluster stops when it reaches a minimum cluster size per the user's choice. To decide whether or not instances within a cluster, c, are similar enough to stop splitting it, QUIST calculates the "spreadness" metric denoted by Ψ, such that the spreadness of c, denoted by Ψ