Goto

Collaborating Authors

 Clustering


A Survey of Data Mining Techniques for Social Media Analysis

arXiv.org Artificial Intelligence

Social network has gained remarkable attention in the last decade. Accessing social network sites such as Twitter, Facebook LinkedIn and Google+ through the internet and the web 2.0 technologies has become more affordable. People are becoming more interested in and relying on social network for information, news and opinion of other users on diverse subject matters. The heavy reliance on social network sites causes them to generate massive data characterised by three computational issues namely; size, noise and dynamism. These issues often make social network data very complex to analyse manually, resulting in the pertinent use of computational means of analysing them. Data mining provides a wide range of techniques for detecting useful knowledge from massive datasets like trends, patterns and rules [44]. Data mining techniques are used for information retrieval, statistical modelling and machine learning. These techniques employ data pre-processing, data analysis, and data interpretation processes in the course of data analysis. This survey discusses different data mining techniques used in mining diverse aspects of the social network over decades going from the historical techniques to the up-to-date models, including our novel technique named TRCM. All the techniques covered in this survey are listed in the Table.1 including the tools employed as well as names of their authors.


Anytime Hierarchical Clustering

arXiv.org Machine Learning

We propose a new anytime hierarchical clustering method that iteratively transforms an arbitrary initial hierarchy on the configuration of measurements along a sequence of trees we prove for a fixed data set must terminate in a chain of nested partitions that satisfies a natural homogeneity requirement. Each recursive step re-edits the tree so as to improve a local measure of cluster homogeneity that is compatible with a number of commonly used (e.g., single, average, complete) linkage functions. As an alternative to the standard batch algorithms, we present numerical evidence to suggest that appropriate adaptations of this method can yield decentralized, scalable algorithms suitable for distributed /parallel computation of clustering hierarchies and online tracking of clustering trees applicable to large, dynamically changing databases and anomaly detection.


Using Analogy to Cluster Hand-Drawn Sketches for Sketch-Based Educational Software

AI Magazine

One of the major challenges to building intelligent educational software is determining what kinds of feedback to give learners. Useful feedback makes use of models of domain-specific knowledge, especially models that are commonly held by potential students. To empirically determine what these models are, student data can be clustered to reveal common misconceptions or common problem-solving strategies. This article describes how analogical retrieval and generalization can be used to cluster automatically analyzed hand-drawn sketches incorporating both spatial and conceptual information. We use this approach to cluster a corpus of hand-drawn student sketches to discover common answers. Common answer clusters can be used for the design of targeted feedback and for assessment.


Sparse K-Means with $\ell_{\infty}/\ell_0$ Penalty for High-Dimensional Data Clustering

arXiv.org Machine Learning

Sparse clustering, which aims to find a proper partition of an extremely high-dimensional data set with redundant noise features, has been attracted more and more interests in recent years. The existing studies commonly solve the problem in a framework of maximizing the weighted feature contributions subject to a $\ell_2/\ell_1$ penalty. Nevertheless, this framework has two serious drawbacks: One is that the solution of the framework unavoidably involves a considerable portion of redundant noise features in many situations, and the other is that the framework neither offers intuitive explanations on why this framework can select relevant features nor leads to any theoretical guarantee for feature selection consistency. In this article, we attempt to overcome those drawbacks through developing a new sparse clustering framework which uses a $\ell_{\infty}/\ell_0$ penalty. First, we introduce new concepts on optimal partitions and noise features for the high-dimensional data clustering problems, based on which the previously known framework can be intuitively explained in principle. Then, we apply the suggested $\ell_{\infty}/\ell_0$ framework to formulate a new sparse k-means model with the $\ell_{\infty}/\ell_0$ penalty ($\ell_0$-k-means for short). We propose an efficient iterative algorithm for solving the $\ell_0$-k-means. To deeply understand the behavior of $\ell_0$-k-means, we prove that the solution yielded by the $\ell_0$-k-means algorithm has feature selection consistency whenever the data matrix is generated from a high-dimensional Gaussian mixture model. Finally, we provide experiments with both synthetic data and the Allen Developing Mouse Brain Atlas data to support that the proposed $\ell_0$-k-means exhibits better noise feature detection capacity over the previously known sparse k-means with the $\ell_2/\ell_1$ penalty ($\ell_1$-k-means for short).


Data generator based on RBF network

arXiv.org Artificial Intelligence

There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with similar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed generator is based on RBF networks which learn sets of Gaussian kernels. Learned Gaussian kernels can be used in a generative mode to generate the data from the same distributions. To asses quality of the generated data we developed several workflows and used them to evaluate the statistical properties of the generated data, structural similarity, and predictive similarity using supervised and unsupervised learning techniques. To determine usability of the proposed generator we conducted a large scale evaluation using 51 UCI data sets. The results show a considerable similarity between the original and generated data and indicate that the method can be useful in several development and simulation scenarios.


Splitting Methods for Convex Clustering

arXiv.org Machine Learning

Clustering is a fundamental problem in many scientific applications. Standard methods such as $k$-means, Gaussian mixture models, and hierarchical clustering, however, are beset by local minima, which are sometimes drastically suboptimal. Recently introduced convex relaxations of $k$-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique global minimizer. In this work we present two splitting methods for solving the convex clustering problem. The first is an instance of the alternating direction method of multipliers (ADMM); the second is an instance of the alternating minimization algorithm (AMA). In contrast to previously considered algorithms, our ADMM and AMA formulations provide simple and unified frameworks for solving the convex clustering problem under the previously studied norms and open the door to potentially novel norms. We demonstrate the performance of our algorithm on both simulated and real data examples. While the differences between the two algorithms appear to be minor on the surface, complexity analysis and numerical experiments show AMA to be significantly more efficient.


Neighborhood Selection for Thresholding-based Subspace Clustering

arXiv.org Machine Learning

Subspace clustering refers to the problem of clustering high-dimensional data points into a union of low-dimensional linear subspaces, where the number of subspaces, their dimensions and orientations are all unknown. In this paper, we propose a variation of the recently introduced thresholding-based subspace clustering (TSC) algorithm, which applies spectral clustering to an adjacency matrix constructed from the nearest neighbors of each data point with respect to the spherical distance measure. The new element resides in an individual and data-driven choice of the number of nearest neighbors. Previous performance results for TSC, as well as for other subspace clustering algorithms based on spectral clustering, come in terms of an intermediate performance measure, which does not address the clustering error directly. Our main analytical contribution is a performance analysis of the modified TSC algorithm (as well as the original TSC algorithm) in terms of the clustering error directly.


Learning Transformations for Clustering and Classification

arXiv.org Machine Learning

A low-rank transformation learning framework for subspace clustering and classification is here proposed. Many high-dimensional data, such as face images and motion sequences, approximately lie in a union of low-dimensional subspaces. The corresponding subspace clustering problem has been extensively studied in the literature to partition such high-dimensional data into clusters corresponding to their underlying low-dimensional subspaces. However, low-dimensional intrinsic structures are often violated for real-world observations, as they can be corrupted by errors or deviate from ideal models. We propose to address this by learning a linear transformation on subspaces using matrix rank, via its convex surrogate nuclear norm, as the optimization criteria. The learned linear transformation restores a low-rank structure for data from the same subspace, and, at the same time, forces a a maximally separated structure for data from different subspaces. In this way, we reduce variations within subspaces, and increase separation between subspaces for a more robust subspace clustering. This proposed learned robust subspace clustering framework significantly enhances the performance of existing subspace clustering methods. Basic theoretical results here presented help to further support the underlying framework. To exploit the low-rank structures of the transformed subspaces, we further introduce a fast subspace clustering technique, which efficiently combines robust PCA with sparse modeling. When class labels are present at the training stage, we show this low-rank transformation framework also significantly enhances classification performance. Extensive experiments using public datasets are presented, showing that the proposed approach significantly outperforms state-of-the-art methods for subspace clustering and classification.


Collaborative Filtering with Information-Rich and Information-Sparse Entities

arXiv.org Machine Learning

In this paper, we consider a popular model for collaborative filtering in recommender systems where some users of a website rate some items, such as movies, and the goal is to recover the ratings of some or all of the unrated items of each user. In particular, we consider both the clustering model, where only users (or items) are clustered, and the co-clustering model, where both users and items are clustered, and further, we assume that some users rate many items (information-rich users) and some users rate only a few items (information-sparse users). When users (or items) are clustered, our algorithm can recover the rating matrix with $\omega(MK \log M)$ noisy entries while $MK$ entries are necessary, where $K$ is the number of clusters and $M$ is the number of items. In the case of co-clustering, we prove that $K^2$ entries are necessary for recovering the rating matrix, and our algorithm achieves this lower bound within a logarithmic factor when $K$ is sufficiently large. We compare our algorithms with a well-known algorithms called alternating minimization (AM), and a similarity score-based algorithm known as the popularity-among-friends (PAF) algorithm by applying all three to the MovieLens and Netflix data sets. Our co-clustering algorithm and AM have similar overall error rates when recovering the rating matrix, both of which are lower than the error rate under PAF. But more importantly, the error rate of our co-clustering algorithm is significantly lower than AM and PAF in the scenarios of interest in recommender systems: when recommending a few items to each user or when recommending items to users who only rated a few items (these users are the majority of the total user population). The performance difference increases even more when noise is added to the datasets.


Matching Image Sets via Adaptive Multi Convex Hull

arXiv.org Machine Learning

Traditional nearest points methods use all the samples in an image set to construct a single convex or affine hull model for classification. However, strong artificial features and noisy data may be generated from combinations of training samples when significant intra-class variations and/or noise occur in the image set. Existing multi-model approaches extract local models by clustering each image set individually only once, with fixed clusters used for matching with various image sets. This may not be optimal for discrimination, as undesirable environmental conditions (eg. illumination and pose variations) may result in the two closest clusters representing different characteristics of an object (eg. frontal face being compared to non-frontal face). To address the above problem, we propose a novel approach to enhance nearest points based methods by integrating affine/convex hull classification with an adapted multi-model approach. We first extract multiple local convex hulls from a query image set via maximum margin clustering to diminish the artificial variations and constrain the noise in local convex hulls. We then propose adaptive reference clustering (ARC) to constrain the clustering of each gallery image set by forcing the clusters to have resemblance to the clusters in the query image set. By applying ARC, noisy clusters in the query set can be discarded. Experiments on Honda, MoBo and ETH-80 datasets show that the proposed method outperforms single model approaches and other recent techniques, such as Sparse Approximated Nearest Points, Mutual Subspace Method and Manifold Discriminant Analysis.