Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search
Spalding-Jamieson, Jack, Robson, Eliot Wong, Zheng, Da Wei
For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $\Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.
Feb-10-2025
- Country:
- South America > Chile
- North America
- United States
- Illinois (0.04)
- New York (0.04)
- Arizona (0.04)
- Ohio (0.04)
- District of Columbia > Washington (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- North Carolina > Wake County
- Raleigh (0.04)
- Texas > Dallas County
- Dallas (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- California > Santa Clara County
- Mountain View (0.04)
- Georgia > Fulton County
- Atlanta (0.04)
- Canada > British Columbia
- United States
- Europe
- Czechia > Prague (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Netherlands > North Brabant
- Eindhoven (0.04)
- Italy > Tuscany
- Florence (0.04)
- France > Provence-Alpes-Côte d'Azur
- Alpes-Maritimes > Nice (0.04)
- Denmark > Central Jutland
- Aarhus (0.04)
- Asia
- Singapore (0.04)
- Malaysia > Kuala Lumpur
- Kuala Lumpur (0.04)
- India > Telangana
- Hyderabad (0.04)
- China > Fujian Province
- Xiamen (0.04)
- Afghanistan > Parwan Province
- Charikar (0.04)
- Genre:
- Research Report (0.82)
- Technology: