Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Spalding-Jamieson, Jack, Robson, Eliot Wong, Zheng, Da Wei

Feb-10-2025–arXiv.org Machine Learning

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $\Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - United States
    - Illinois (0.04)
    - New York (0.04)
    - Arizona (0.04)
    - Ohio (0.04)
    - District of Columbia > Washington (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.04)
    - North Carolina > Wake County
      - Raleigh (0.04)
    - Texas > Dallas County
      - Dallas (0.04)
    - Massachusetts > Suffolk County
      - Boston (0.04)
    - California > Santa Clara County
      - Mountain View (0.04)
    - Georgia > Fulton County
      - Atlanta (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
    - Vancouver Island > Capital Regional District
      - Victoria (0.04)
- Europe
  - Czechia > Prague (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Netherlands > North Brabant
    - Eindhoven (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Alpes-Maritimes > Nice (0.04)
  - Denmark > Central Jutland
    - Aarhus (0.04)
- Asia
  - Singapore (0.04)
  - Malaysia > Kuala Lumpur
    - Kuala Lumpur (0.04)
  - India > Telangana
    - Hyderabad (0.04)
  - China > Fujian Province
    - Xiamen (0.04)
  - Afghanistan > Parwan Province
    - Charikar (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Search (1.00)
  - Natural Language > Information Retrieval (1.00)
  - Machine Learning > Statistical Learning
    - Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found