initial center
k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy
In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. We propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower error than those from another popular method $k$-median++, also with higher efficiency when $k$ is not too small. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- (17 more...)
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- (17 more...)
Fast Clustering of Categorical Big Data
The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.
- North America > United States > Texas (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- Asia > Nepal (0.04)
- Health & Medicine (0.68)
- Information Technology (0.47)
k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy
In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the k -median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. We propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower error than those from another popular method k -median, also with higher efficiency when k is not too small. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers.
$k$-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy
Fan, Chenglin, Li, Ping, Li, Xiaoyun
When designing clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. In this paper, we develop a new initialization scheme, called HST initialization, for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. Our proposed HST initialization can produce initial centers achieving lower errors than those from another popular initialization method, $k$-median++, with comparable efficiency. The HST initialization can also be extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error from applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments justify the theory and demonstrate the effectiveness of our proposed method. Our approach can also be extended to the $k$-means problem.
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Washington > King County > Bellevue (0.04)
- (18 more...)
Generalization of k-means Related Algorithms
This article briefly introduced Arthur and Vassilvitshii's work on \textbf{k-means++} algorithm and further generalized the center initialization process. It is found that choosing the most distant sample point from the nearest center as new center can mostly have the same effect as the center initialization process in the \textbf{k-means++} algorithm.
- North America > United States > Virginia (0.05)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.05)
- North America > United States > California > Santa Clara County > San Jose (0.05)
Spaces of Clusterings
Rolle, Alexander, Scoccola, Luis
Often, a clustering algorithm, rather than producing a single clustering of a dataset, produces a set of clusterings. For example, one gets a set of clusterings by running a clustering algorithm with a range of parameters, or with many initializations. Given a set S of clusterings of a dataset X, one may want to know how many different kinds of clusterings the set S contains, ignoring small differences between elements of S. In effect, one may want to cluster S. This paper proposes two clustering algorithms, specifically for use on sets of clusterings of a fixed dataset. The starting point is the observation that sets of clusterings have geometric structure.Indeed, there are many ways, described in the literature, to define a metric on the set of all clusterings of a fixed dataset, and it is a natural idea to use such metrics to cluster a set of clusterings.
- North America > United States > New York (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
Semi-supervised K-means++
Yoder, Jordan, Priebe, Carey E.
Traditionally, practitioners initialize the {\tt k-means} algorithm with centers chosen uniformly at random. Randomized initialization with uneven weights ({\tt k-means++}) has recently been used to improve the performance over this strategy in cost and run-time. We consider the k-means problem with semi-supervised information, where some of the data are pre-labeled, and we seek to label the rest according to the minimum cost solution. By extending the {\tt k-means++} algorithm and analysis to account for the labels, we derive an improved theoretical bound on expected cost and observe improved performance in simulated and real data examples. This analysis provides theoretical justification for a roughly linear semi-supervised clustering algorithm.
- North America > United States > Virginia (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- Asia > Middle East > Jordan (0.04)
Clustering Stability: An Overview
A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)