AITopics

Country:

Europe (1.00)
North America > United States > California (0.28)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Neural Information Processing SystemsDec-27-2025, 03:01:33 GMT

k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy

In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. We propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower error than those from another popular method $k$-median++, also with higher efficiency when $k$ is not too small. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.

better initialization, k-median clustering, metric embedding, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.78)

Neural Information Processing SystemsNov-19-2025, 23:34:26 GMT

e9a612969b4df241ff0d8273656bd5a4-Paper-Conference.pdf

We propose a novel and efficient search algorithm which finds initial centers that can be used subsequently for the local search algorithm.

algorithm, artificial intelligence, machine learning, (17 more...)

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
(17 more...)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Thapaliya, Bipana, Zhuang, Yu

Fast Clustering of Categorical Big Data

arXiv.org Artificial IntelligenceFeb-15-2025

The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.07081

Country: North America > United States (0.68)

Genre: Research Report (0.91)

Industry:

Health & Medicine (0.68)
Information Technology (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsJan-20-2025, 01:29:33 GMT

k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy

In clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. We propose a new initialization scheme for the k -median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. We propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower error than those from another popular method k -median, also with higher efficiency when k is not too small. Our HST initialization can also be easily extended to the setting of differential privacy (DP) to generate private initial centers.

better initialization, differential privacy, initial center, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.95)
Information Technology > Artificial Intelligence > Machine Learning (0.83)

arXiv.org Machine LearningJul-8-2022

$k$-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy

Fan, Chenglin, Li, Ping, Li, Xiaoyun

When designing clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. In this paper, we develop a new initialization scheme, called HST initialization, for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. Our proposed HST initialization can produce initial centers achieving lower errors than those from another popular initialization method, $k$-median++, with comparable efficiency. The HST initialization can also be extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error from applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments justify the theory and demonstrate the effectiveness of our proposed method. Our approach can also be extended to the $k$-means problem.

artificial intelligence, initialization, machine learning, (18 more...)

2206.12895

Country:

North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(18 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

arXiv.org Machine LearningMar-24-2019

Generalization of k-means Related Algorithms

Li, Yiwei

This article briefly introduced Arthur and Vassilvitshii's work on \textbf{k-means++} algorithm and further generalized the center initialization process. It is found that choosing the most distant sample point from the nearest center as new center can mostly have the same effect as the center initialization process in the \textbf{k-means++} algorithm.

algorithm, artificial intelligence, machine learning, (15 more...)

1903.10025

Country: North America > United States (0.47)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Rolle, Alexander, Scoccola, Luis

Spaces of Clusterings

arXiv.org Machine LearningFeb-4-2019

Often, a clustering algorithm, rather than producing a single clustering of a dataset, produces a set of clusterings. For example, one gets a set of clusterings by running a clustering algorithm with a range of parameters, or with many initializations. Given a set S of clusterings of a dataset X, one may want to know how many different kinds of clusterings the set S contains, ignoring small differences between elements of S. In effect, one may want to cluster S. This paper proposes two clustering algorithms, specifically for use on sets of clusterings of a fixed dataset. The starting point is the observation that sets of clusterings have geometric structure.Indeed, there are many ways, described in the literature, to define a metric on the set of all clusterings of a fixed dataset, and it is a natural idea to use such metrics to cluster a set of clusterings.

algorithm, algorithm 1, dataset, (11 more...)

1902.01436

Country:

North America > United States > New York (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Yoder, Jordan, Priebe, Carey E.

Semi-supervised K-means++

arXiv.org Machine LearningJan-31-2016

Traditionally, practitioners initialize the {\tt k-means} algorithm with centers chosen uniformly at random. Randomized initialization with uneven weights ({\tt k-means++}) has recently been used to improve the performance over this strategy in cost and run-time. We consider the k-means problem with semi-supervised information, where some of the data are pre-labeled, and we seek to label the rest according to the minimum cost solution. By extending the {\tt k-means++} algorithm and analysis to account for the labels, we derive an improved theoretical bound on expected cost and observe improved performance in simulated and real data examples. This analysis provides theoretical justification for a roughly linear semi-supervised clustering algorithm.

algorithm, artificial intelligence, machine learning, (17 more...)

1602.0036

Country: North America > United States (0.93)

Genre: Research Report (0.50)

Industry: Government > Military (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

arXiv.org Machine LearningJul-7-2010

Clustering Stability: An Overview

von Luxburg, Ulrike

A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this paper we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.

algorithm, artificial intelligence, machine learning, (13 more...)

doi: 10.1561/2200000008

1007.1075

Country: North America > United States (0.92)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)