CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data

Belucci, Bruno, Lounici, Karim, Meziani, Katia

Feb-1-2025–arXiv.org Machine Learning

Clustering high-dimensional data poses significant High-dimensional datasets suffer from the well-known challenges due to the curse of dimensionality, "curse of dimensionality." As the dimensionality p increases, scalability issues, and the presence of noisy and the relevant information often lies in a low-dimensional subspace, irrelevant features. We propose Consensus Hierarchical with the remaining dimensions contributing predominantly Random Feature (CoHiRF), a novel to noise. Consequently, data points tend to become clustering method designed to address these challenges equidistant in high-dimensional space, rendering traditional effectively. CoHiRF leverages random feature distance-based clustering algorithms, such as K-Means, less selection to mitigate noise and dimensionality effective (Beyer et al., 1999). Specifically, the Euclidean distance effects, repeatedly applies K-Means clustering metric loses its discriminative power, resulting in poor in reduced feature spaces, and combines results clustering performance. Another critical challenge is scalability: through a unanimous consensus criterion. This traditional clustering methods, originally designed iterative approach constructs a cluster assignment for low-dimensional or small datasets, often struggle with matrix, where each row records the cluster assignments high computational and memory demands when applied of a sample across repetitions, enabling the to high-dimensional data settings (Steinbach et al., 2004; identification of stable clusters by comparing identical Assent, 2012; Zimek et al., 2012; Mahdi et al., 2021).

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

Feb-1-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Oregon (0.14)

Genre:
- Research Report (1.00)
- Workflow (0.68)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)