Goto

Collaborating Authors

 Clustering


Julia Language in Machine Learning: Algorithms, Applications, and Open Issues

arXiv.org Machine Learning

Machine learning is driving development across many fields in science and engineering. A simple and efficient programming language could accelerate applications of machine learning in various fields. Currently, the programming languages most commonly used to develop machine learning algorithms include Python, MATLAB, and C/C ++. However, none of these languages well balance both efficiency and simplicity. The Julia language is a fast, easy-to-use, and open-source programming language that was originally designed for high-performance computing, which can well balance the efficiency and simplicity. This paper summarizes the related research work and developments in the application of the Julia language in machine learning. It first surveys the popular machine learning algorithms that are developed in the Julia language. Then, it investigates applications of the machine learning algorithms implemented with the Julia language. Finally, it discusses the open issues and the potential future directions that arise in the use of the Julia language in machine learning.


Hypergraph Clustering in the Weighted Stochastic Block Model via Convex Relaxation of Truncated MLE

arXiv.org Machine Learning

We study hypergraph clustering under the weighted $d$-uniform hypergraph stochastic block model ($d$-WHSBM), where each edge consisting of $d$ nodes has higher expected weight if $d$ nodes are from the same community compared to edges consisting of nodes from different communities. We propose a new hypergraph clustering algorithm, which is a convex relaxation of truncated maximum likelihood estimator (CRTMLE), that can handle the relatively sparse, high-dimensional regime of the $d$-WHSBM with community sizes of different orders. We provide performance guarantees of this algorithm under a unified framework for different parameter regimes, and show that it achieves the order-wise optimal or the best existing results for approximately balanced community sizes. We also demonstrate the first recovery guarantees for the setting with growing number of communities of unbalanced sizes.


Efficient Clustering for Stretched Mixtures: Landscape and Optimality

arXiv.org Machine Learning

This paper considers a canonical clustering problem where one receives unlabeled samples drawn from a balanced mixture of two elliptical distributions and aims for a classifier to estimate the labels. Many popular methods including PCA and k-means require individual components of the mixture to be somewhat spherical, and perform poorly when they are stretched. To overcome this issue, we propose a non-convex program seeking for an affine transform to turn the data into a one-dimensional point cloud concentrating around -1 and 1, after which clustering becomes easy. Our theoretical contributions are two-fold: (1) we show that the non-convex loss function exhibits desirable landscape properties as long as the sample size exceeds some constant multiple of the dimension, and (2) we leverage this to prove that an efficient first-order algorithm achieves near-optimal statistical precision even without good initialization. We also propose a general methodology for multi-class clustering tasks with flexible choices of feature transforms and loss objectives.


A semi-supervised sparse K-Means algorithm

arXiv.org Machine Learning

We consider the problem of data clustering with unidentified feature quality but the existence of small amount of label data. In the first case a sparse clustering method can be employed in order to detect the subgroup of features necessary for clustering and in the second case a semi-supervised method can use the labelled data to create constraints and enhance the clustering solution. In this paper we propose a K-Means inspired algorithm that employs these techniques. We show that the algorithm maintains the high performance of other similar semi-supervised algorthms as well as keeping the ability to identify informative from uninformative features. We examine the performance of the algorithm on real world data sets with unknown features quality as well as a real world data set with a known uninformative feature. We use a series of scenarios with different number and types of constraints.


Flattening a Hierarchical Clustering through Active Learning

Neural Information Processing Systems

We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. In the realizable setting, we provide a full characterization of the number of queries needed to achieve perfect reconstruction of the tree cut. In the non-realizable setting, we rely on known important-sampling procedures to obtain regret and query complexity bounds. Our algorithms come with theoretical guarantees on the statistical error and, more importantly, lend themselves to {\em linear-time} implementations in the relevant parameters of the problem. We discuss such implementations, prove running time guarantees for them, and present preliminary experiments on real-world datasets showing the compelling practical performance of our algorithms as compared to both passive learning and simple active learning baselines.


Random Projections and Sampling Algorithms for Clustering of High-Dimensional Polygonal Curves

Neural Information Processing Systems

We study the $k$-median clustering problem for high-dimensional polygonal curves with finite but unbounded number of vertices. We tackle the computational issue that arises from the high number of dimensions by defining a Johnson-Lindenstrauss projection for polygonal curves. We analyze the resulting error in terms of the Fr\'echet distance, which is a tractable and natural dissimilarity measure for curves. Our clustering algorithms achieve sublinear dependency on the number of input curves via subsampling. Also, we show that the Fr\'echet distance can not be approximated within any factor of less than $\sqrt{2}$ by probabilistically reducing the dependency on the number of vertices of the curves.


k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

The input to the \emph{$k$-mean for lines} problem is a set $L$ of $n$ lines in $\mathbb{R} d$, and the goal is to compute a set of $k$ centers (points) in $\mathbb{R} d$ that minimizes the sum of squared distances over every line in $L$ and its nearest center. This is a straightforward generalization of the $k$-mean problem where the input is a set of $n$ points instead of lines. We suggest the first PTAS that computes a $(1 \epsilon)$-approximation to this problem in time $O(n \log n)$ for any constant approximation error $\epsilon \in (0, 1)$, and constant integers $k, d \geq 1$. This is by proving that there is always a weighted subset (called coreset) of $dk {O(k)}\log (n)/\epsilon 2$ lines in $L$ that approximates the sum of squared distances from $L$ to \emph{any} given set of $k$ points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to $M$ machines in one pass (e.g.


Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

We consider the widely-used average-linkage, single-linkage, and Ward's methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. It is easy to show that there is no efficient implementation of these algorithms in high dimensional Euclidean space since it implicitly requires to solve the closest pair problem, a notoriously difficult problem. However, how fast can these algorithms be implemented if we allow approximation? More precisely: these algorithms successively merge the clusters that are at closest average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge $\gamma$-approximate closest clusters (namely, clusters that are at distance (average, minimum, or sum-of-square error) at most $\gamma$ times the distance of the closest clusters).


Coresets for Clustering with Fairness Constraints

Neural Information Processing Systems

In a recent work, \cite{chierichetti2017fair} studied the following fair'' variants of classical clustering problems such as k-means and k-median: given a set of n data points in R d and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender \cite{bera2019fair}, or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve {\em multiple, non-disjoint} attributes, that is {\em also scalable}. Our approach is based on novel constructions of coresets: for the k-median objective, we construct an \eps-coreset of size O(\Gamma k 2 \eps {-d}) where \Gamma is the number of distinct collections of groups that a point may belong to, and for the k-means objective, we show how to construct an \eps-coreset of size O(\Gamma k 3\eps {-d-1}). The former result is the first known coreset construction for the fair clustering problem with the k-median objective, and the latter result removes the dependence on the size of the full dataset as in \cite{schmidt2018fair} and generalizes it to multiple, non-disjoint attributes.


Foundations of Comparison-Based Hierarchical Clustering

Neural Information Processing Systems

We address the classical problem of hierarchical clustering, but in a framework where one does not have access to a representation of the objects or their pairwise similarities. Instead, we assume that only a set of comparisons between objects is available, that is, statements of the form objects i and j are more similar than objects k and l.'' Such a scenario is commonly encountered in crowdsourcing applications. The focus of this work is to develop comparison-based hierarchical clustering algorithms that do not rely on the principles of ordinal embedding. We show that single and complete linkage are inherently comparison-based and we develop variants of average linkage.