k-medoid problem
Global Optimal K-Medoids Clustering of One Million Samples
We study the deterministic global optimization of the K-Medoids clustering problem. This work proposes a branch and bound (BB) scheme, in which a tailored Lagrangian relaxation method proposed in the 1970s is used to provide a lower bound at each BB node. The lower bounding method already guarantees the maximum gap at the root node. A closed-form solution to the lower bound can be derived analytically without explicitly solving any optimization problems, and its computation can be easily parallelized. Moreover, with this lower bounding method, finite convergence to the global optimal solution can be guaranteed by branching only on the regions of medoids. We also present several tailored bound tightening techniques to reduce the search space and computational cost. Extensive computational studies on 28 machine learning datasets demonstrate that our algorithm can provide a provable global optimal solution with an optimality gap of 0.1% within 4 hours on datasets with up to one million samples. Besides, our algorithm can obtain better or equal objective values than the heuristic method. A theoretical proof of global convergence for our algorithm is also presented.
BanditPAM++: Faster k-medoids Clustering
Clustering is a fundamental task in data science with wide-ranging applications. In k-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in k-medoids clustering, respectively.
BanditPAM++: Faster k-medoids Clustering
Clustering is a fundamental task in data science with wide-ranging applications. In k-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in k-medoids clustering, respectively.
EKM: An exact, polynomial-time algorithm for the $K$-medoids problem
The $K$-medoids problem is a challenging combinatorial clustering task, widely used in data analysis applications. While numerous algorithms have been proposed to solve this problem, none of these are able to obtain an exact (globally optimal) solution for the problem in polynomial time. In this paper, we present EKM: a novel algorithm for solving this problem exactly with worst-case $O\left(N^{K+1}\right)$ time complexity. EKM is developed according to recent advances in transformational programming and combinatorial generation, using formal program derivation steps. The derived algorithm is provably correct by construction. We demonstrate the effectiveness of our algorithm by comparing it against various approximate methods on numerous real-world datasets. We show that the wall-clock run time of our algorithm matches the worst-case time complexity analysis on synthetic datasets, clearly outperforming the exponential time complexity of benchmark branch-and-bound based MIP solvers. To our knowledge, this is the first, rigorously-proven polynomial time, practical algorithm for this ubiquitous problem.
BanditPAM++: Faster $k$-medoids Clustering
Tiwari, Mo, Kang, Ryan, Lee, Donghyun, Thrun, Sebastian, Piech, Chris, Shomorony, Ilan, Zhang, Martin Jinye
Clustering is a fundamental task in data science with wide-ranging applications. In $k$-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in $k$-medoids clustering, respectively. $k$-medoids clustering has recently grown in popularity due to the discovery of more efficient $k$-medoids algorithms. In particular, recent research has proposed BanditPAM, a randomized $k$-medoids algorithm with state-of-the-art complexity and clustering accuracy. In this paper, we present BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is $O(k)$ faster than BanditPAM in complexity and substantially faster than BanditPAM in wall-clock runtime. First, we demonstrate that BanditPAM has a special structure that allows the reuse of clustering information $\textit{within}$ each iteration. Second, we demonstrate that BanditPAM has additional structure that permits the reuse of information $\textit{across}$ different iterations. These observations inspire our proposed algorithm, BanditPAM++, which returns the same clustering solutions as BanditPAM but often several times faster. For example, on the CIFAR10 dataset, BanditPAM++ returns the same results as BanditPAM but runs over 10$\times$ faster. Finally, we provide a high-performance C++ implementation of BanditPAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/BanditPAM. Auxiliary code to reproduce all of our experiments via a one-line script is available at https://github.com/ThrunGroup/BanditPAM_plusplus_experiments.