An Efficient $k$-modes Algorithm for Clustering Categorical Datasets
Dorman, Karin S., Maitra, Ranjan
Mining clusters from datasets is an important endeavor in many applications. The k-means algorithm is a popular and efficient, distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. We provide a novel, computationally efficient implementation of k-modes, called OTQT. We prove that OTQT finds updates, undetectable to existing k-modes algorithms, that improve the objective function. Thus, although slightly slower per iteration owing to its algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. As a result, we recommend OTQT as the preferred, default algorithm for all k-modes implementations. We also examine five initialization methods and three types of K-selection methods, many of them novel or novel applications to k-modes. By examining performance on real and simulated datasets, we show that simple random initialization is the best initializer and that a novel K-selection method is more accurate than methods adapted from k-means. Identifying groups of similar observations in datasets is common in a wide array of applications, with many clustering methods developed in statistics, machine learning and the applied sciences [1]-[7]. The k-means algorithm [8]-[11] is arguably the most popular method for clustering numerical-valued observations. It scales to large datasets because it does not require calculation of all pairwise distances, and it is distribution-free. While distribution-free does not imply it is assumption-free [12], [13], it is a starting place for users wary of making assumptions about their data. Unfortunately, k-means does not provide an appropriate objective to minimize for datasets with categorical attributes.
Jun-6-2020
- Country:
- North America
- United States
- Wisconsin > Dane County
- Madison (0.04)
- Virginia > Fairfax County
- McLean (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- New York > New York County
- New York City (0.04)
- New Jersey > Mercer County
- Princeton (0.04)
- Iowa > Story County
- Ames (0.04)
- Florida > Palm Beach County
- Boca Raton (0.04)
- California > San Diego County
- San Diego (0.04)
- Wisconsin > Dane County
- Canada
- British Columbia (0.04)
- Alberta > Census Division No. 15
- Improvement District No. 9 > Banff (0.04)
- United States
- Europe
- Italy > Sardinia (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Slovenia > Central Slovenia
- Municipality of Ljubljana > Ljubljana (0.04)
- Asia > India
- North America
- Genre:
- Research Report
- New Finding (0.67)
- Experimental Study (0.45)
- Research Report
- Industry:
- Technology: