Fast and Provably Good Seedings for k-Means Mario Lucic Department of Computer Science Department of Computer Science ETH Zurich

Apr-10-2023, 10:57:50 GMT–Neural Information Processing Systems

Seeding - the task of finding initial cluster centers - is critical in obtaining highquality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces provably good clusterings even without assumptions on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude.

algorithm, mc 2, ssumption, (14 more...)

Neural Information Processing Systems

Apr-10-2023, 10:57:50 GMT

Conferences PDF

Add feedback

Country:
- Europe
  - Switzerland > Zürich
    - Zürich (0.41)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)