Goto

Collaborating Authors

 Clustering


A Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization: Convergence Analysis and Optimality

arXiv.org Machine Learning

Symmetric nonnegative matrix factorization (SymNMF) has important applications in data analytics problems such as document clustering, community detection and image segmentation. In this paper, we propose a novel nonconvex variable splitting method for solving SymNMF. The proposed algorithm is guaranteed to converge to the set of Karush-Kuhn-Tucker (KKT) points of the nonconvex SymNMF problem. Furthermore, it achieves a global sublinear convergence rate. We also show that the algorithm can be efficiently implemented in parallel. Further, sufficient conditions are provided which guarantee the global and local optimality of the obtained solutions. Extensive numerical results performed on both synthetic and real data sets suggest that the proposed algorithm converges quickly to a local minimum solution.


On the Existence of Kernel Function for Kernel-Trick of k-Means

arXiv.org Machine Learning

This paper corrects the proof of the Theorem 2 from the Gower's paper \cite[page 5]{Gower:1982}. The correction is needed in order to establish the existence of the kernel function used commonly in the kernel trick e.g. for $k$-means clustering algorithm, on the grounds of distance matrix. The scope of correction is explained in section 2.


Clustering for Different Scales of Measurement - the Gap-Ratio Weighted K-means Algorithm

arXiv.org Machine Learning

This paper describes a method for clustering data that are spread out over large regions and which dimensions are on different scales of measurement. Such an algorithm was developed to implement a robotics application consisting in sorting and storing objects in an unsupervised way. The toy dataset used to validate such application consists of Lego bricks of different shapes and colors. The uncontrolled lighting conditions together with the use of RGB color features, respectively involve data with a large spread and different levels of measurement between data dimensions. To overcome the combination of these two characteristics in the data, we have developed a new weighted K-means algorithm, called gap-ratio K-means, which consists in weighting each dimension of the feature space before running the K-means algorithm. The weight associated with a feature is proportional to the ratio of the biggest gap between two consecutive data points, and the average of all the other gaps. This method is compared with two other variants of K-means on the Lego bricks clustering problem as well as two other common classification datasets.


Independence clustering (without a matrix)

arXiv.org Machine Learning

The independence clustering problem is considered in the following formulation: given a set $S$ of random variables, it is required to find the finest partitioning $\{U_1,\dots,U_k\}$ of $S$ into clusters such that the clusters $U_1,\dots,U_k$ are mutually independent. Since mutual independence is the target, pairwise similarity measurements are of no use, and thus traditional clustering algorithms are inapplicable. The distribution of the random variables in $S$ is, in general, unknown, but a sample is available. Thus, the problem is cast in terms of time series. Two forms of sampling are considered: i.i.d.\ and stationary time series, with the main emphasis being on the latter, more general, case. A consistent, computationally tractable algorithm for each of the settings is proposed, and a number of open directions for further research are outlined.


Clustering Similar Stories Using LDA -- Flipboard Engineering

#artificialintelligence

There is more to a story than meets the eye, and some stories deserve to be presented from more than just one perspective. With Flipboard 4.0, we have released story roundups, a new feature that adds coverage from multiple sources to a story and provides you with a fuller picture of an event. With our scale of millions of articles and constant stream of documents, it's impossible to generate these roundups manually. So, we have developed a clustering algorithm that's both fast and scalable, and in this blog post, I will explain how we create these roundups on Flipboard. Although there are many sophisticated automatic clustering algorithms, such as K-means or Agglomerative clustering, story clustering is a non-trivial problem.


A Random Finite Set Model for Data Clustering

arXiv.org Machine Learning

The goal of data clustering is to partition data points into groups to minimize a given objective function. While most existing clustering algorithms treat each data point as vector, in many applications each datum is not a vector but a point pattern or a set of points. Moreover, many existing clustering methods require the user to specify the number of clusters, which is not available in advance. This paper proposes a new class of models for data clustering that addresses set-valued data as well as unknown number of clusters, using a Dirichlet Process mixture of Poisson random finite sets. We also develop an efficient Markov Chain Monte Carlo posterior inference technique that can learn the number of clusters and mixture parameters automatically from the data. Numerical studies are presented to demonstrate the salient features of this new model, in particular its capacity to discover extremely unbalanced clusters in data.


Image Compression using K-means Clustering : Colour Quantization

#artificialintelligence

This post is a simple yet illustrative application of K-means clustering technique. Using K-means clustering, we will perform quantization of colours present in the image which will further help in compressing the image. In a coloured image, each pixel is of size 3 bytes (RGB), where each colour can have intensity values from 0 to 255. Following combinatorics, the total number of colours which can be represented are 256*256*256. Practically, we are able to visualize only a few colours in an image.


On Approximation Guarantees for Greedy Low Rank Optimization

arXiv.org Machine Learning

We provide new approximation guarantees for greedy low rank matrix estimation under standard assumptions of restricted strong convexity and smoothness. Our novel analysis also uncovers previously unknown connections between the low rank estimation and combinatorial optimization, so much so that our bounds are reminiscent of corresponding approximation bounds in submodular maximization. Additionally, we also provide statistical recovery guarantees. Finally, we present empirical comparison of greedy estimation with established baselines on two important real-world problems.


Clustering-Based Relational Unsupervised Representation Learning with an Explicit Distributed Representation

arXiv.org Machine Learning

The goal of unsupervised representation learning is to extract a new representation of data, such that solving many different tasks becomes easier. Existing methods typically focus on vectorized data and offer little support for relational data, which additionally describe relationships among instances. In this work we introduce an approach for relational unsupervised representation learning. Viewing a relational dataset as a hypergraph, new features are obtained by clustering vertices and hyperedges. To find a representation suited for many relational learning tasks, a wide range of similarities between relational objects is considered, e.g. feature and structural similarities. We experimentally evaluate the proposed approach and show that models learned on such latent representations perform better, have lower complexity, and outperform the existing approaches on classification tasks.


An expressive dissimilarity measure for relational clustering using neighbourhood trees

arXiv.org Artificial Intelligence

Clustering is an underspecified task: there are no universal criteria for what makes a good clustering. This is especially true for relational data, where similarity can be based on the features of individuals, the relationships between them, or a mix of both. Existing methods for relational clustering have strong and often implicit biases in this respect. In this paper, we introduce a novel similarity measure for relational data. It is the first measure to incorporate a wide variety of types of similarity, including similarity of attributes, similarity of relational context, and proximity in a hypergraph. We experimentally evaluate how using this similarity affects the quality of clustering on very different types of datasets. The experiments demonstrate that (a) using this similarity in standard clustering methods consistently gives good results, whereas other measures work well only on datasets that match their bias; and (b) on most datasets, the novel similarity outperforms even the best among the existing ones.