Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data

Ballal, Aditya, Datta, Esha, DePaul, Gregory A., Carlsson, Erik, Chen-Izu, Ye, López, Javier E., Izu, Leighton T.

Jan-16-2025–arXiv.org Machine Learning

Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(N*k*d), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

Jan-16-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.28)

Genre:
- Research Report (0.84)

Industry:
- Health & Medicine > Therapeutic Area (0.69)

Technology:
- Information Technology
  - Data Science > Data Mining (1.00)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning > Clustering (1.00)