Clustering
An Online Hierarchical Algorithm for Extreme Clustering
Kobren, Ari, Monath, Nicholas, Krishnamurthy, Akshay, McCallum, Andrew
Clustering algorithms are a crucial component of any data scientist's toolbox with applications ranging from identifying themes in large text corpora [10], to finding functionally similar genes [17], to visualization, pre-processing, and dimensionality reduction [21]. As such, a number of clustering algorithms have been developed and studied by the statistics, machine learning, and theoretical computer science communities. These algorithms and analyses target a variety of scenarios, including large-scale, online, or streaming settings [38, 1], clustering with distribution shift [2], and many more. Modern clustering applications require algorithms that scale gracefully with dataset size and complexity. In clustering, data set size is measured by the number of points N and their dimensionality d, while the number of clusters, K, serves as a measure of complexity. While several existing algorithms can cope with large datasets, very few adequately handle datasets with many clusters. We call problem instances with large N and large K extreme clustering problems-a phrase inspired by work in extreme classification [12]. Extreme clustering problems are increasingly prevalent. For example, in entity resolution, record linkage and deduplication, the number of clusters (i.e., entities) increases with dataset size [6] and can be in the The first two authors contributed equally.
Color quantization using k-means
The idea is to give a grasp on some concepts that are necessary to understand what comes next without being too much detailed as a more detailed explanation is out of the scope of this post. Feel free to skip these parts if you already know what they're talking about. As previously anticipated a color can be represented as a point in an n-dimensional space called color space. Most commonly the space is 3-dimensional and the coordinates in that space can be used to encode a color. There are many color spaces for different purposes and with different gamut (range of colors), and in each of them it is possibile to define a distance metric that quantifies the color difference. The most common and easiest distance metric used is the Euclidean distance which is used in RGB and Lab color spaces. The RGB (abbreviation of red-green-blue) color space is by far the most common and used color space. The idea is that it is possibile to create colors by combining red, green and blue. A color in RGB is usually encoded as a 3-tuple of 8 bits each, hence each dimension takes a value within the range [0, 255] where 0 stands for absence of color while 255 stands for full presence of color.
A Brain-like Cognitive Process with Shared Methods
This paper describes a new entropy-style of equation that may be useful in a general sense, but can be applied to a cognitive model with related processes. The model is based on the human brain, with automatic and distributed pattern activity. Methods for carrying out the different processes are suggested. The main purpose of this paper is to reaffirm earlier research on different knowledge-based and experience-based clustering techniques. The overall architecture has stayed essentially the same and so it is the localised processes or smaller details that have been updated. For example, a counting mechanism is used slightly differently, to measure a level of 'cohesion' instead of a 'correct' classification, over pattern instances. The introduction of features has further enhanced the architecture and the new entropy-style equation is proposed. While an earlier paper defined three levels of functional requirement, this paper re-defines the levels in a more human vernacular, with higher-level goals described in terms of action-result pairs.
How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms
While there's not necessarily a "correct" answer here, it's most likely you split the bugs into four clusters. That wasn't too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare -- or a passion for entomology -- you could probably even do the same with a hundred bugs. For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them. With a hundred bugs -- there'd be many times more solutions than there are particles in the known universe. In fact, there are more than four million billion googol solutions (what's a googol?).
How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms
While there's not necessarily a "correct" answer here, it's most likely you split the bugs into four clusters. That wasn't too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare -- or a passion for entomology -- you could probably even do this same with a hundred bugs. For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them. With a hundred bugs -- there'd be many times more solutions than there are particles in the known universe. In fact, there are more than four million billion googol solutions (what's a googol?).
Community detection and stochastic block models: recent developments
The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences. This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds. The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed.
Improving Spectral Clustering using the Asymptotic Value of the Normalised Cut
Spectral clustering is a popular and versatile clustering method based on a relaxation of the normalised graph cut objective. Despite its popularity, however, there is no single agreed upon method for tuning the important scaling parameter, nor for determining automatically the number of clusters to extract. Popular heuristics exist, but corresponding theoretical results are scarce. In this paper we investigate the asymptotic value of the normalised cut for an increasing sample assumed to arise from an underlying probability distribution, and based on this result provide recommendations for improving spectral clustering methodology. A corresponding algorithm is proposed with strong empirical performance.
Hybrid Clustering based on Content and Connection Structure using Joint Nonnegative Matrix Factorization
Du, Rundong, Drake, Barry, Park, Haesun
We present a hybrid method for latent information discovery on the data sets containing both text content and connection structure based on constrained low rank approximation. The new method jointly optimizes the Nonnegative Matrix Factorization (NMF) objective function for text clustering and the Symmetric NMF (SymNMF) objective function for graph clustering. We propose an effective algorithm for the joint NMF objective function, based on a block coordinate descent (BCD) framework. The proposed hybrid method discovers content associations via latent connections found using SymNMF. The method can also be applied with a natural conversion of the problem when a hypergraph formulation is used or the content is associated with hypergraph edges. Experimental results show that by simultaneously utilizing both content and connection structure, our hybrid method produces higher quality clustering results compared to the other NMF clustering methods that uses content alone (standard NMF) or connection structure alone (SymNMF). We also present some interesting applications to several types of real world data such as citation recommendations of papers. The hybrid method proposed in this paper can also be applied to general data expressed with both feature space vectors and pairwise similarities and can be extended to the case with multiple feature spaces or multiple similarity measures.
Algebraic Variety Models for High-Rank Matrix Completion
Ongie, Greg, Willett, Rebecca, Nowak, Robert D., Balzano, Laura
We consider a generalization of low-rank matrix completion to the case where the data belongs to an algebraic variety, i.e. each data point is a solution to a system of polynomial equations. In this case the original matrix is possibly high-rank, but it becomes low-rank after mapping each column to a higher dimensional space of monomial features. Many well-studied extensions of linear models, including affine subspaces and their union, can be described by a variety model. In addition, varieties can be used to model a richer class of nonlinear quadratic and higher degree curves and surfaces. We study the sampling requirements for matrix completion under a variety model with a focus on a union of affine subspaces. We also propose an efficient matrix completion algorithm that minimizes a convex or non-convex surrogate of the rank of the matrix of monomial features. Our algorithm uses the well-known "kernel trick" to avoid working directly with the high-dimensional monomial matrix. We show the proposed algorithm is able to recover synthetically generated data up to the predicted sampling complexity bounds. The proposed algorithm also outperforms standard low rank matrix completion and subspace clustering techniques in experiments with real data.
Asymmetric Learning Vector Quantization for Efficient Nearest Neighbor Classification in Dynamic Time Warping Spaces
Jain, Brijnesh, Schultz, David
The nearest neighbor (NN) classifier endowed with the dynamic time warping (DTW) distance is one of the most popular methods in time series classification [9, 44]. Application examples include electrocardiogram frame classification [16], gesture recognition [2, 32], speech recognition [24], and voice recognition [23]. Two disadvantages of the naive NN method are high storage and computation requirements. Storage requirements are high, because the entire training set needs to be retained for being able to execute its classification rule. Computation requirements are high, because classifying a test example demands calculation of DTW distances between the test and all training examples.