AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Online Clustering with Bandit Information

Chandran, G Dhinesh, Kota, Srinivas Reddy, Bhashyam, Srikrishna

arXiv.org Artificial IntelligenceFeb-2-2025

We study the problem of online clustering within the multi-armed bandit framework under the fixed confidence setting. In this multi-armed bandit problem, we have $M$ arms, each providing i.i.d. samples that follow a multivariate Gaussian distribution with an {\em unknown} mean and a known unit covariance. The arms are grouped into $K$ clusters based on the distance between their means using the Single Linkage (SLINK) clustering algorithm on the means of the arms. Since the true means are unknown, the objective is to obtain the above clustering of the arms with the minimum number of samples drawn from the arms, subject to an upper bound on the error probability. We introduce a novel algorithm, Average Tracking Bandit Online Clustering (ATBOC), and prove that this algorithm is order optimal, meaning that the upper bound on its expected sample complexity for given error probability $\delta$ is within a factor of 2 of an instance-dependent lower bound as $\delta \rightarrow 0$. Furthermore, we propose a computationally more efficient algorithm, Lower and Upper Confidence Bound-based Bandit Online Clustering (LUCBBOC), inspired by the LUCB algorithm for best arm identification. Simulation results demonstrate that the performance of LUCBBOC is comparable to that of ATBOC. We numerically assess the effectiveness of the proposed algorithms through numerical experiments on both synthetic datasets and the real-world MovieLens dataset. To the best of our knowledge, this is the first work on bandit online clustering that allows arms with different means in a cluster and $K$ greater than 2.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.11421

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data

Belucci, Bruno, Lounici, Karim, Meziani, Katia

arXiv.org Machine LearningFeb-1-2025

Clustering high-dimensional data poses significant High-dimensional datasets suffer from the well-known challenges due to the curse of dimensionality, "curse of dimensionality." As the dimensionality p increases, scalability issues, and the presence of noisy and the relevant information often lies in a low-dimensional subspace, irrelevant features. We propose Consensus Hierarchical with the remaining dimensions contributing predominantly Random Feature (CoHiRF), a novel to noise. Consequently, data points tend to become clustering method designed to address these challenges equidistant in high-dimensional space, rendering traditional effectively. CoHiRF leverages random feature distance-based clustering algorithms, such as K-Means, less selection to mitigate noise and dimensionality effective (Beyer et al., 1999). Specifically, the Euclidean distance effects, repeatedly applies K-Means clustering metric loses its discriminative power, resulting in poor in reduced feature spaces, and combines results clustering performance. Another critical challenge is scalability: through a unanimous consensus criterion. This traditional clustering methods, originally designed iterative approach constructs a cluster assignment for low-dimensional or small datasets, often struggle with matrix, where each row records the cluster assignments high computational and memory demands when applied of a sample across repetitions, enabling the to high-dimensional data settings (Steinbach et al., 2004; identification of stable clusters by comparing identical Assent, 2012; Zimek et al., 2012; Mahdi et al., 2021).

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

2502.0038

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)

Genre:

Research Report (1.00)
Workflow (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Comparative Analysis of Community Detection Algorithms on the SNAP Social Circles Dataset

Malode, Yash, Aylani, Amit, Bhardwaj, Arvind, Hajoary, Deepak

arXiv.org Artificial IntelligenceFeb-1-2025

In network research, Community Detection has always been a topic of significant interest in network science, with numerous papers and algorithms proposing to uncover the underlying structures within networks. In this paper, we conduct a comparative analysis of several prominent community detection algorithms applied to the SNAP Social Circles Dataset, derived from the Facebook Social Media network. The algorithms implemented include Louvain, Girvan-Newman, Spectral Clustering, K-Means Clustering, etc. We evaluate the performance of these algorithms based on various metrics such as modularity, normalized cut-ratio, silhouette score, compactness, and separability. Our findings reveal insights into the effectiveness of each algorithm in detecting various meaningful communities within the social network, shedding light on their strength and limitations. This research contributes to the understanding of community detection methods and provides valuable guidance for their application in analyzing real-world social networks.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.04341

Country:

Asia > India > Maharashtra > Mumbai (0.04)
Asia > India > Assam (0.04)

Genre: Research Report > New Finding (0.35)

Industry: Information Technology > Services (0.72)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Explorations of the Softmax Space: Knowing When the Neural Network Doesn't Know...

Sikar, Daniel, Garcez, Artur d'Avila, Weyde, Tillman

arXiv.org Artificial IntelligenceFeb-1-2025

Ensuring the reliability and safety of automated decision-making is crucial. This paper proposes a new approach for measuring the reliability of predictions in machine learning models. We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions. We assign each prediction to a cluster with centroid representing the mean softmax output for all correct predictions of a given class. We then define a safety threshold for a class as the smallest distance from an incorrect prediction to the given class centroid. We evaluate the approach on the MNIST and CIFAR-10 datasets using a Convolutional Neural Network and a Vision Transformer, respectively. The results show that our approach is consistent across these data sets and network models, and indicate that the proposed metric can offer an efficient way of determining when automated predictions are acceptable and when they should be deferred to human operators.

artificial intelligence, machine learning, prediction, (18 more...)

arXiv.org Artificial Intelligence

2502.00456

Genre: Research Report > New Finding (0.88)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning

Guo, Jiaxin, Chen, C. L. Philip, Li, Shuzhen, Zhang, Tong

arXiv.org Artificial IntelligenceJan-31-2025

Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1162/tacl_a_00731

2502.00305

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)
(17 more...)

Genre: Research Report (1.00)

Industry:

Education (1.00)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)

Add feedback

Algorithmic Clustering based on String Compression to Extract P300 Structure in EEG Signals

Sarasa, Guillermo, Granados, Ana, Rodríguez, Francisco B

arXiv.org Artificial IntelligenceJan-31-2025

P300 is an Event-Related Potential widely used in Brain-Computer Interfaces, but its detection is challenging due to inter-subject and temporal variability. This work introduces a clustering methodology based on Normalized Compression Distance (NCD) to extract the P300 structure, ensuring robustness against variability. We propose a novel signal-to-ASCII transformation to generate compression-friendly objects, which are then clustered using a hierarchical tree-based method and a multidimensional projection approach. Experimental results on two datasets demonstrate the method's ability to reveal relevant P300 structures, showing clustering performance comparable to state-of-the-art approaches. Furthermore, analysis at the electrode level suggests that the method could assist in electrode selection for P300 detection. This compression-driven clustering methodology offers a complementary tool for EEG analysis and P300 identification.

artificial intelligence, information management, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.cmpb.2019.03.009

2502.0022

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Information Management (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Cognitive Science (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Clustering in hyperbolic balls

Jaćimović, Vladimir, Crnkić, Aladin

arXiv.org Artificial IntelligenceJan-31-2025

The idea of representations of the data in negatively curved manifolds recently attracted a lot of attention and gave a rise to the new research direction named {\it hyperbolic machine learning} (ML). In order to unveil the full potential of this new paradigm, efficient techniques for data analysis and statistical modeling in hyperbolic spaces are necessary. In the present paper rigorous mathematical framework for clustering in hyperbolic spaces is established. First, we introduce the $k$-means clustering in hyperbolic balls, based on the novel definition of barycenter. Second, we present the expectation-maximization (EM) algorithm for learning mixtures of novel probability distributions in hyperbolic balls. In such a way we lay the foundation of unsupervised learning in hyperbolic spaces.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.19247

Country:

Europe > Montenegro > Podgorica > Podgorica (0.04)
Europe > Bosnia and Herzegovina (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Dual-Bounded Nonlinear Optimal Transport for Size Constrained Min Cut Clustering

Xie, Fangyuan, Yuan, Jinghui, Nie, Feiping, Li, Xuelong

arXiv.org Artificial IntelligenceJan-30-2025

Min cut is an important graph partitioning method. However, current solutions to the min cut problem suffer from slow speeds, difficulty in solving, and often converge to simple solutions. To address these issues, we relax the min cut problem into a dual-bounded constraint and, for the first time, treat the min cut problem as a dual-bounded nonlinear optimal transport problem. Additionally, we develop a method for solving dual-bounded nonlinear optimal transport based on the Frank-Wolfe method (abbreviated as DNF). Notably, DNF not only solves the size constrained min cut problem but is also applicable to all dual-bounded nonlinear optimal transport problems. We prove that for convex problems satisfying Lipschitz smoothness, the DNF method can achieve a convergence rate of $\mathcal{O}(\frac{1}{t})$. We apply the DNF method to the min cut problem and find that it achieves state-of-the-art performance in terms of both the loss function and clustering accuracy at the fastest speed, with a convergence rate of $\mathcal{O}(\frac{1}{\sqrt{t}})$. Moreover, the DNF method for the size constrained min cut problem requires no parameters and exhibits better stability.

data mining, machine learning, submission and formatting instruction, (16 more...)

arXiv.org Artificial Intelligence

2501.18143

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Spain (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

ACTGNN: Assessment of Clustering Tendency with Synthetically-Trained Graph Neural Networks

Luo, Yiran, Papalexakis, Evangelos E.

arXiv.org Artificial IntelligenceJan-29-2025

Determining clustering tendency in datasets is a fundamental but challenging task, especially in noisy or high-dimensional settings where traditional methods, such as the Hopkins Statistic and Visual Assessment of Tendency (VAT), often struggle to produce reliable results. In this paper, we propose ACTGNN, a graph-based framework designed to assess clustering tendency by leveraging graph representations of data. Node features are constructed using Locality-Sensitive Hashing (LSH), which captures local neighborhood information, while edge features incorporate multiple similarity metrics, such as the Radial Basis Function (RBF) kernel, to model pairwise relationships. A Graph Neural Network (GNN) is trained exclusively on synthetic datasets, enabling robust learning of clustering structures under controlled conditions. Extensive experiments demonstrate that ACTGNN significantly outperforms baseline methods on both synthetic and real-world datasets, exhibiting superior performance in detecting faint clustering structures, even in high-dimensional or noisy data. Our results highlight the generalizability and effectiveness of the proposed approach, making it a promising tool for robust clustering tendency assessment.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2501.18112

Country:

North America > United States > California > Riverside County > Riverside (0.14)
North America > United States > California > Alameda County > Oakland (0.04)
Asia (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

KNN and K-means in Gini Prametric Spaces

Mussard, Cassandra, Charpentier, Arthur, Mussard, Stéphane

arXiv.org Artificial IntelligenceJan-29-2025

This paper introduces innovative enhancements to the K-means and K-nearest neighbors (KNN) algorithms based on the concept of Gini prametric spaces. Unlike traditional distance metrics, Gini-based measures incorporate both value-based and rank-based information, improving robustness to noise and outliers. The main contributions of this work include: proposing a Gini-based measure that captures both rank information and value distances; presenting a Gini K-means algorithm that is proven to converge and demonstrates resilience to noisy data; and introducing a Gini KNN method that performs competitively with state-of-the-art approaches such as Hassanat's distance in noisy environments. Experimental evaluations on 14 datasets from the UCI repository demonstrate the superior performance and efficiency of Gini-based algorithms in clustering and classification tasks. This work opens new avenues for leveraging rank-based measures in machine learning and statistical analysis.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.18028

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.06)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > Texas (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.86)

Add feedback