Big-Data Clustering: K-Means or K-Indicators?

Chen, Feiyu, Yang, Yuchen, Xu, Liwei, Zhang, Taiping, Zhang, Yin

Jun-3-2019–arXiv.org Machine Learning

The K-means algorithm is arguably the most popular data clustering method, commonly applied to processed datasets in some "feature spaces", as is in spectral clustering. Highly sensitive to initializations, however, K-means encounters a scalability bottleneck with respect to the number of clusters K as this number grows in big data applications. In this work, we promote a closely related model called K-indicators model and construct an efficient, semi-convex-relaxation algorithm that requires no randomized initializations. We present extensive empirical results to show advantages of the new algorithm when K is large. In particular, using the new algorithm to start the K-means algorithm, without any replication, can significantly outperform the standard K-means with a large number of currently state-of-the-art random replications.

algorithm, dataset, replication, (15 more...)

arXiv.org Machine Learning

Jun-3-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Texas > Harris County
    - Houston (0.04)
  - California > Alameda County
    - Oakland (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China
    - Chongqing Province > Chongqing (0.05)
    - Guangdong Province > Shenzhen (0.04)
    - Sichuan Province > Chengdu (0.04)
    - Hong Kong (0.04)
  - Afghanistan > Parwan Province
    - Charikar (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found