Big-Data Clustering: K-Means or K-Indicators?
Chen, Feiyu, Yang, Yuchen, Xu, Liwei, Zhang, Taiping, Zhang, Yin
The K-means algorithm is arguably the most popular data clustering method, commonly applied to processed datasets in some "feature spaces", as is in spectral clustering. Highly sensitive to initializations, however, K-means encounters a scalability bottleneck with respect to the number of clusters K as this number grows in big data applications. In this work, we promote a closely related model called K-indicators model and construct an efficient, semi-convex-relaxation algorithm that requires no randomized initializations. We present extensive empirical results to show advantages of the new algorithm when K is large. In particular, using the new algorithm to start the K-means algorithm, without any replication, can significantly outperform the standard K-means with a large number of currently state-of-the-art random replications.
Jun-3-2019
- Country:
- North America > United States
- Texas > Harris County
- Houston (0.04)
- California > Alameda County
- Oakland (0.04)
- Texas > Harris County
- Asia
- Middle East > Jordan (0.04)
- China
- Chongqing Province > Chongqing (0.05)
- Guangdong Province > Shenzhen (0.04)
- Sichuan Province > Chengdu (0.04)
- Hong Kong (0.04)
- Afghanistan > Parwan Province
- Charikar (0.04)
- North America > United States
- Genre:
- Research Report (0.50)
- Technology: