Goto

Collaborating Authors

 Clustering


Online Clustering with Bandit Information

arXiv.org Artificial Intelligence

We study the problem of online clustering within the multi-armed bandit framework under the fixed confidence setting. In this multi-armed bandit problem, we have $M$ arms, each providing i.i.d. samples that follow a multivariate Gaussian distribution with an {\em unknown} mean and a known unit covariance. The arms are grouped into $K$ clusters based on the distance between their means using the Single Linkage (SLINK) clustering algorithm on the means of the arms. Since the true means are unknown, the objective is to obtain the above clustering of the arms with the minimum number of samples drawn from the arms, subject to an upper bound on the error probability. We introduce a novel algorithm, Average Tracking Bandit Online Clustering (ATBOC), and prove that this algorithm is order optimal, meaning that the upper bound on its expected sample complexity for given error probability $\delta$ is within a factor of 2 of an instance-dependent lower bound as $\delta \rightarrow 0$. Furthermore, we propose a computationally more efficient algorithm, Lower and Upper Confidence Bound-based Bandit Online Clustering (LUCBBOC), inspired by the LUCB algorithm for best arm identification. Simulation results demonstrate that the performance of LUCBBOC is comparable to that of ATBOC. We numerically assess the effectiveness of the proposed algorithms through numerical experiments on both synthetic datasets and the real-world MovieLens dataset. To the best of our knowledge, this is the first work on bandit online clustering that allows arms with different means in a cluster and $K$ greater than 2.


CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data

arXiv.org Machine Learning

Clustering high-dimensional data poses significant High-dimensional datasets suffer from the well-known challenges due to the curse of dimensionality, "curse of dimensionality." As the dimensionality p increases, scalability issues, and the presence of noisy and the relevant information often lies in a low-dimensional subspace, irrelevant features. We propose Consensus Hierarchical with the remaining dimensions contributing predominantly Random Feature (CoHiRF), a novel to noise. Consequently, data points tend to become clustering method designed to address these challenges equidistant in high-dimensional space, rendering traditional effectively. CoHiRF leverages random feature distance-based clustering algorithms, such as K-Means, less selection to mitigate noise and dimensionality effective (Beyer et al., 1999). Specifically, the Euclidean distance effects, repeatedly applies K-Means clustering metric loses its discriminative power, resulting in poor in reduced feature spaces, and combines results clustering performance. Another critical challenge is scalability: through a unanimous consensus criterion. This traditional clustering methods, originally designed iterative approach constructs a cluster assignment for low-dimensional or small datasets, often struggle with matrix, where each row records the cluster assignments high computational and memory demands when applied of a sample across repetitions, enabling the to high-dimensional data settings (Steinbach et al., 2004; identification of stable clusters by comparing identical Assent, 2012; Zimek et al., 2012; Mahdi et al., 2021).


Comparative Analysis of Community Detection Algorithms on the SNAP Social Circles Dataset

arXiv.org Artificial Intelligence

In network research, Community Detection has always been a topic of significant interest in network science, with numerous papers and algorithms proposing to uncover the underlying structures within networks. In this paper, we conduct a comparative analysis of several prominent community detection algorithms applied to the SNAP Social Circles Dataset, derived from the Facebook Social Media network. The algorithms implemented include Louvain, Girvan-Newman, Spectral Clustering, K-Means Clustering, etc. We evaluate the performance of these algorithms based on various metrics such as modularity, normalized cut-ratio, silhouette score, compactness, and separability. Our findings reveal insights into the effectiveness of each algorithm in detecting various meaningful communities within the social network, shedding light on their strength and limitations. This research contributes to the understanding of community detection methods and provides valuable guidance for their application in analyzing real-world social networks.


Explorations of the Softmax Space: Knowing When the Neural Network Doesn't Know...

arXiv.org Artificial Intelligence

Ensuring the reliability and safety of automated decision-making is crucial. This paper proposes a new approach for measuring the reliability of predictions in machine learning models. We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions. We assign each prediction to a cluster with centroid representing the mean softmax output for all correct predictions of a given class. We then define a safety threshold for a class as the smallest distance from an incorrect prediction to the given class centroid. We evaluate the approach on the MNIST and CIFAR-10 datasets using a Convolutional Neural Network and a Vision Transformer, respectively. The results show that our approach is consistent across these data sets and network models, and indicate that the proposed metric can offer an efficient way of determining when automated predictions are acceptable and when they should be deferred to human operators.


DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning

arXiv.org Artificial Intelligence

Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.


Algorithmic Clustering based on String Compression to Extract P300 Structure in EEG Signals

arXiv.org Artificial Intelligence

P300 is an Event-Related Potential widely used in Brain-Computer Interfaces, but its detection is challenging due to inter-subject and temporal variability. This work introduces a clustering methodology based on Normalized Compression Distance (NCD) to extract the P300 structure, ensuring robustness against variability. We propose a novel signal-to-ASCII transformation to generate compression-friendly objects, which are then clustered using a hierarchical tree-based method and a multidimensional projection approach. Experimental results on two datasets demonstrate the method's ability to reveal relevant P300 structures, showing clustering performance comparable to state-of-the-art approaches. Furthermore, analysis at the electrode level suggests that the method could assist in electrode selection for P300 detection. This compression-driven clustering methodology offers a complementary tool for EEG analysis and P300 identification.


Clustering in hyperbolic balls

arXiv.org Artificial Intelligence

The idea of representations of the data in negatively curved manifolds recently attracted a lot of attention and gave a rise to the new research direction named {\it hyperbolic machine learning} (ML). In order to unveil the full potential of this new paradigm, efficient techniques for data analysis and statistical modeling in hyperbolic spaces are necessary. In the present paper rigorous mathematical framework for clustering in hyperbolic spaces is established. First, we introduce the $k$-means clustering in hyperbolic balls, based on the novel definition of barycenter. Second, we present the expectation-maximization (EM) algorithm for learning mixtures of novel probability distributions in hyperbolic balls. In such a way we lay the foundation of unsupervised learning in hyperbolic spaces.


Dual-Bounded Nonlinear Optimal Transport for Size Constrained Min Cut Clustering

arXiv.org Artificial Intelligence

Min cut is an important graph partitioning method. However, current solutions to the min cut problem suffer from slow speeds, difficulty in solving, and often converge to simple solutions. To address these issues, we relax the min cut problem into a dual-bounded constraint and, for the first time, treat the min cut problem as a dual-bounded nonlinear optimal transport problem. Additionally, we develop a method for solving dual-bounded nonlinear optimal transport based on the Frank-Wolfe method (abbreviated as DNF). Notably, DNF not only solves the size constrained min cut problem but is also applicable to all dual-bounded nonlinear optimal transport problems. We prove that for convex problems satisfying Lipschitz smoothness, the DNF method can achieve a convergence rate of \(\mathcal{O}(\frac{1}{t})\). We apply the DNF method to the min cut problem and find that it achieves state-of-the-art performance in terms of both the loss function and clustering accuracy at the fastest speed, with a convergence rate of \(\mathcal{O}(\frac{1}{\sqrt{t}})\). Moreover, the DNF method for the size constrained min cut problem requires no parameters and exhibits better stability.


ACTGNN: Assessment of Clustering Tendency with Synthetically-Trained Graph Neural Networks

arXiv.org Artificial Intelligence

Determining clustering tendency in datasets is a fundamental but challenging task, especially in noisy or high-dimensional settings where traditional methods, such as the Hopkins Statistic and Visual Assessment of Tendency (VAT), often struggle to produce reliable results. In this paper, we propose ACTGNN, a graph-based framework designed to assess clustering tendency by leveraging graph representations of data. Node features are constructed using Locality-Sensitive Hashing (LSH), which captures local neighborhood information, while edge features incorporate multiple similarity metrics, such as the Radial Basis Function (RBF) kernel, to model pairwise relationships. A Graph Neural Network (GNN) is trained exclusively on synthetic datasets, enabling robust learning of clustering structures under controlled conditions. Extensive experiments demonstrate that ACTGNN significantly outperforms baseline methods on both synthetic and real-world datasets, exhibiting superior performance in detecting faint clustering structures, even in high-dimensional or noisy data. Our results highlight the generalizability and effectiveness of the proposed approach, making it a promising tool for robust clustering tendency assessment.


KNN and K-means in Gini Prametric Spaces

arXiv.org Artificial Intelligence

This paper introduces innovative enhancements to the K-means and K-nearest neighbors (KNN) algorithms based on the concept of Gini prametric spaces. Unlike traditional distance metrics, Gini-based measures incorporate both value-based and rank-based information, improving robustness to noise and outliers. The main contributions of this work include: proposing a Gini-based measure that captures both rank information and value distances; presenting a Gini K-means algorithm that is proven to converge and demonstrates resilience to noisy data; and introducing a Gini KNN method that performs competitively with state-of-the-art approaches such as Hassanat's distance in noisy environments. Experimental evaluations on 14 datasets from the UCI repository demonstrate the superior performance and efficiency of Gini-based algorithms in clustering and classification tasks. This work opens new avenues for leveraging rank-based measures in machine learning and statistical analysis.