Clustering
Is there a way to adaptively guess k (the number of clusters) during online k-means?
The moment you say K-means, it indicates you have knowledge of number of clusters (i.e. K) in advance and you are not going to change it later on. I believe, what you intend to ask is, is there an automatic way to adaptively changing the number of clusters as new data arrives. Normally, in online clustering you start with one sample (hence one cluster) and based on *some* criteria, you either merge or break clusters to adaptively change number of clusters; this process is not online k-means clustering but only online clustering. In online K-means clustering, you update the cluster center information for every sample and do not wait for all the samples to arrive (or else it becomes traidtional offline k-means clustering).
A Semi-Definite Programming approach to low dimensional embedding for unsupervised clustering
Chrรฉtien, Stรฉphane, Dombry, Clรฉment, Faivre, Adrien
This paper proposes a variant of the method of Gu\'edon and Verhynin for estimating the cluster matrix in the Mixture of Gaussians framework via Semi-Definite Programming. A clustering oriented embedding is deduced from this estimate. The procedure is suitable for very high dimensional data because it is based on pairwise distances only. Theoretical garantees are provided and an eigenvalue optimisation approach is proposed for computing the embedding. The performance of the method is illustrated via Monte Carlo experiements and comparisons with other embeddings from the literature.
A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting
Subedi, Sanjeena, McNicholas, Paul D.
Mixture model-based clustering has become an increasingly popular data analysis technique since its introduction fifty years ago, and is now commonly utilized within the family setting. Families of mixture models arise when the component parameters, usually the component covariance matrices, are decomposed and a number of constraints are imposed. Within the family setting, we need to choose the member of the family, i.e., the appropriate covariance structure, in addition to the number of mixture components. To date, the Bayesian information criterion (BIC) has proved most effective for model selection, and the expectation-maximization (EM) algorithm is usually used for parameter estimation. To date, this EM-BIC rubric has monopolized the literature on families of mixture models. We deviate from this rubric, using variational Bayes approximations for parameter estimation and the deviance information criterion for model selection. The variational Bayes approach alleviates some of the computational complexities associated with the EM algorithm by constructing a tight lower bound on the complex marginal likelihood and maximizing this lower bound by minimizing the associated Kullback-Leibler divergence. We use this approach on the most famous family of Gaussian mixture models within the literature and real and simulated data are used to compare our approach to the EM-BIC rubric.
The Dependent Random Measures with Independent Increments in Mixture Models
Luo, Cheng, Da Xu, Richard Yi, Xiang, Yang
When observations are organized into groups where commonalties exist amongst them, the dependent random measures can be an ideal choice for modeling. One of the propositions of the dependent random measures is that the atoms of the posterior distribution are shared amongst groups, and hence groups can borrow information from each other. When normalized dependent random measures prior with independent increments are applied, we can derive appropriate exchangeable probability partition function (EPPF), and subsequently also deduce its inference algorithm given any mixture model likelihood. We provide all necessary derivation and solution to this framework. For demonstration, we used mixture of Gaussians likelihood in combination with a dependent structure constructed by linear combinations of CRMs. Our experiments show superior performance when using this framework, where the inferred values including the mixing weights and the number of clusters both respond appropriately to the number of completely random measure used.
Grouping the executables to detect malware with high accuracy
Sahay, Sanjay K., Sharma, Ashu
The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups which can identify their respective families. In addition, the grouping criteria should be chosen such a way that, it can also be applied to unknown files encounter on computers for classification. This paper discusses the study of malware and benign executables in groups to detect unknown malware with high accuracy. We studied sizes of malware generated by three popular second generation malware (metamorphic malware) creator kits viz. G2, PS-MPC and NGVCK, and observed that the size variation in any two generated malware from same kit is not much. Hence, we grouped the executables on the basis of malware sizes by using Optimal k-Means Clustering algorithm and used these obtained groups to select promising features for training (Random forest, J48, LMT, FT and NBT) classifiers to detect variants of malware or unknown malware. We find that detection of malware on the basis of their respected file sizes gives accuracy up to 99.11% from the classifiers.
A Fuzzy Clustering Algorithm for the Mode Seeking Framework
The analysis of large and possibly high-dimensional datasets is becoming ubiquitous in the sciences. The long-term objective is to gain insight into the structure of measurement or simulation data, for a better understanding of the underlying physical phenomena at work. Clustering is one of the simplest ways of gaining such insight, by finding a suitable decomposition of the data into clusters such that data points within a same cluster share common (and, if possible, exclusive) properties. In this work, we are interested in the mode seeking approach to clustering. This approach assumes the data points to be drawn from some unknown probability distribution and defines the clusters as the basins of attraction of the maxima of the density, requiring a preliminary density estimation phase [7, 5, 10, 11, 13, 15]. The theoretical analysis of this clustering framework has drawn increasing attention recently, see 1 [6, 3, 9, 8, 2]. However, this (hard) clustering method provides a fairly limited knowledge on the structure of the data: while the partition into clusters is well understood, the interplay between clusters (respective locations, proximity relations, interactions) remains unknown. Identifying interfaces between clusters is the first step towards a higher-level understanding of the data, and it already plays a prominent role in some applications such as the study of the conformations space of a protein, where a fundamental question beyond the detection of metastable states is to understand when and how the protein can switch from one metastable state to another [12]. Hard clustering can be used in this context, for instance by defining the border between two clusters as the set of data points whose neighborhood (in the ambient space or in some neighborhood graph) intersects the two clusters, however this kind of information is by nature unstable with respect to perturbations of the data.
A Survey of Signed Network Mining in Social Media
Tang, Jiliang, Chang, Yi, Aggarwal, Charu, Liu, Huan
Many real-world relations can be represented by signed networks with positive and negative links, as a result of which signed network analysis has attracted increasing attention from multiple disciplines. With the increasing prevalence of social media networks, signed network analysis has evolved from developing and measuring theories to mining tasks. In this article, we present a review of mining signed networks in the context of social media and discuss some promising research directions and new frontiers. We begin by giving basic concepts and unique properties and principles of signed networks. Then we classify and review tasks of signed network mining with representative algorithms. We also delineate some tasks that have not been extensively studied with formal definitions and also propose research directions to expand the field of signed network mining.
Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation
Srivastava, Akash, Zou, James, Adams, Ryan P., Sutton, Charles
A good clustering can help a data analyst to explore and understand a data set, but what constitutes a good clustering may depend on domain-specific and application-specific criteria. These criteria can be difficult to formalize, even when it is easy for an analyst to know a good clustering when they see one. We present a new approach to interactive clustering for data exploration called TINDER, based on a particularly simple feedback mechanism, in which an analyst can reject a given clustering and request a new one, which is chosen to be different from the previous clustering while fitting the data well. We formalize this interaction in a Bayesian framework as a method for prior elicitation, in which each different clustering is produced by a prior distribution that is modified to discourage previously rejected clusterings. We show that TINDER successfully produces a diverse set of clusterings, each of equivalent quality, that are much more diverse than would be obtained by randomized restarts.
Ground Truth Bias in External Cluster Validity Indices
Lei, Yang, Bezdek, James C., Romano, Simone, Vinh, Nguyen Xuan, Chan, Jeffrey, Bailey, James
It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand index (RI) exhibits a monotone increasing (NCinc) bias, while the Jaccard Index (JI) index suffers from a monotone decreasing (NCdec) bias. This type of bias has been previously recognized in the literature. In this work, we identify a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared. We call this new type of bias ground truth (GT) bias. This type of bias occurs if a change in the reference partition causes a change in the bias status (e.g., NCinc, NCdec) of a CVI. For example, NCinc bias in the RI can be changed to NCdec bias by skewing the distribution of clusters in the ground truth partition. It is important for users to be aware of this new type of biased behaviour, since it may affect the interpretations of CVI results. The objective of this article is to study the empirical and theoretical implications of GT bias. To the best of our knowledge, this is the first extensive study of such a property for external cluster validity indices.
Achieving Exact Cluster Recovery Threshold via Semidefinite Programming: Extensions
Hajek, Bruce, Wu, Yihong, Xu, Jiaming
Resolving a conjecture of Abbe, Bandeira and Hall, the authors have recently shown that the semidefinite programming (SDP) relaxation of the maximum likelihood estimator achieves the sharp threshold for exactly recovering the community structure under the binary stochastic block model of two equal-sized clusters. The same was shown for the case of a single cluster and outliers. Extending the proof techniques, in this paper it is shown that SDP relaxations also achieve the sharp recovery threshold in the following cases: (1) Binary stochastic block model with two clusters of sizes proportional to network size but not necessarily equal; (2) Stochastic block model with a fixed number of equal-sized clusters; (3) Binary censored block model with the background graph being Erd\H{o}s-R\'enyi. Furthermore, a sufficient condition is given for an SDP procedure to achieve exact recovery for the general case of a fixed number of clusters plus outliers. These results demonstrate the versatility of SDP relaxation as a simple, general purpose, computationally feasible methodology for community detection.