Goto

Collaborating Authors

 Clustering


AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density

arXiv.org Artificial Intelligence

DBSCAN has been widely used in density-based clustering algorithms. However, with the increasing demand for Multi-density clustering, previous traditional DSBCAN can not have good clustering results on Multi-density datasets. In order to address this problem, an adaptive Multi-density DBSCAN algorithm (AMD-DBSCAN) is proposed in this paper. An improved parameter adaptation method is proposed in AMD-DBSCAN to search for multiple parameter pairs (i.e., Eps and MinPts), which are the key parameters to determine the clustering results and performance, therefore allowing the model to be applied to Multi-density datasets. Moreover, only one hyperparameter is required for AMD-DBSCAN to avoid the complicated repetitive initialization operations. Furthermore, the variance of the number of neighbors (VNN) is proposed to measure the difference in density between each cluster. The experimental results show that our AMD-DBSCAN reduces execution time by an average of 75% due to lower algorithm complexity compared with the traditional adaptive algorithm. In addition, AMD-DBSCAN improves accuracy by 24.7% on average over the state-of-the-art design on Multi-density datasets of extremely variable density, while having no performance loss in Single-density scenarios. Our code and datasets are available at https://github.com/AlexandreWANG915/AMD-DBSCAN.


Transferable Deep Clustering Model

arXiv.org Artificial Intelligence

Deep learning has shown remarkable success in the field of clustering recently. However, how to transfer a trained clustering model on a source domain to a target domain by leveraging the acquired knowledge to guide the clustering process remains challenging. Existing deep clustering methods often lack generalizability to new domains because they typically learn a group of fixed cluster centroids, which may not be optimal for the new domain distributions. In this paper, we propose a novel transferable deep clustering model that can automatically adapt the cluster centroids according to the distribution of data samples. Rather than learning a fixed set of centroids, our approach introduces a novel attention-based module that can adapt the centroids by measuring their relationship with samples. In addition, we theoretically show that our model is strictly more powerful than some classical clustering algorithms such as k-means or Gaussian Mixture Model (GMM). Experimental results on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of our proposed transfer learning framework, which significantly improves the performance on target domain and reduces the computational cost.


Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

arXiv.org Artificial Intelligence

The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57% to 66% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful text labels for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets without predefined labels, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.


Clustering Categorical Data: Soft Rounding k-modes

arXiv.org Artificial Intelligence

Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.


Joint Projection Learning and Tensor Decomposition Based Incomplete Multi-view Clustering

arXiv.org Artificial Intelligence

Incomplete multi-view clustering (IMVC) has received increasing attention since it is often that some views of samples are incomplete in reality. Most existing methods learn similarity subgraphs from original incomplete multi-view data and seek complete graphs by exploring the incomplete subgraphs of each view for spectral clustering. However, the graphs constructed on the original high-dimensional data may be suboptimal due to feature redundancy and noise. Besides, previous methods generally ignored the graph noise caused by the inter-class and intra-class structure variation during the transformation of incomplete graphs and complete graphs. To address these problems, we propose a novel Joint Projection Learning and Tensor Decomposition Based method (JPLTD) for IMVC. Specifically, to alleviate the influence of redundant features and noise in high-dimensional data, JPLTD introduces an orthogonal projection matrix to project the high-dimensional features into a lower-dimensional space for compact feature learning.Meanwhile, based on the lower-dimensional space, the similarity graphs corresponding to instances of different views are learned, and JPLTD stacks these graphs into a third-order low-rank tensor to explore the high-order correlations across different views. We further consider the graph noise of projected data caused by missing samples and use a tensor-decomposition based graph filter for robust clustering.JPLTD decomposes the original tensor into an intrinsic tensor and a sparse tensor. The intrinsic tensor models the true data similarities. An effective optimization algorithm is adopted to solve the JPLTD model. Comprehensive experiments on several benchmark datasets demonstrate that JPLTD outperforms the state-of-the-art methods. The code of JPLTD is available at https://github.com/weilvNJU/JPLTD.


Robust convex biclustering with a tuning-free method

arXiv.org Machine Learning

Biclustering is widely used in different kinds of fields including gene information analysis, text mining, and recommendation system by effectively discovering the local correlation between samples and features. However, many biclustering algorithms will collapse when facing heavy-tailed data. In this paper, we propose a robust version of convex biclustering algorithm with Huber loss. Yet, the newly introduced robustification parameter brings an extra burden to selecting the optimal parameters. Therefore, we propose a tuning-free method for automatically selecting the optimal robustification parameter with high efficiency. The simulation study demonstrates the more fabulous performance of our proposed method than traditional biclustering methods when encountering heavy-tailed noise. A real-life biomedical application is also presented. The R package RcvxBiclustr is available at https://github.com/YifanChen3/RcvxBiclustr.


A Process for Topic Modelling Via Word Embeddings

arXiv.org Artificial Intelligence

This work combines algorithms based on word embeddings, dimensionality reduction, and clustering. The objective is to obtain topics from a set of unclassified texts. The algorithm to obtain the word embeddings is the BERT model, a neural network architecture widely used in NLP tasks. Due to the high dimensionality, a dimensionality reduction technique called UMAP is used. This method manages to reduce the dimensions while preserving part of the local and global information of the original data. K-Means is used as the clustering algorithm to obtain the topics. Then, the topics are evaluated using the TF-IDF statistics, Topic Diversity, and Topic Coherence to get the meaning of the words on the clusters. The results of the process show good values, so the topic modeling of this process is a viable option for classifying or clustering texts without labels.


Universal Detection of Backdoor Attacks via Density-based Clustering and Centroids Analysis

arXiv.org Artificial Intelligence

We propose a Universal Defence against backdoor attacks based on Clustering and Centroids Analysis (CCA-UD). The goal of the defence is to reveal whether a Deep Neural Network model is subject to a backdoor attack by inspecting the training dataset. CCA-UD first clusters the samples of the training set by means of density-based clustering. Then, it applies a novel strategy to detect the presence of poisoned clusters. The proposed strategy is based on a general misclassification behaviour observed when the features of a representative example of the analysed cluster are added to benign samples. The capability of inducing a misclassification error is a general characteristic of poisoned samples, hence the proposed defence is attack-agnostic. This marks a significant difference with respect to existing defences, that, either can defend against only some types of backdoor attacks, or are effective only when some conditions on the poisoning ratio or the kind of triggering signal used by the attacker are satisfied. Experiments carried out on several classification tasks and network architectures, considering different types of backdoor attacks (with either clean or corrupted labels), and triggering signals, including both global and local triggering signals, as well as sample-specific and source-specific triggers, reveal that the proposed method is very effective to defend against backdoor attacks in all the cases, always outperforming the state of the art techniques.


Is Simple Uniform Sampling Effective for Center-Based Clustering with Outliers: When and Why?

arXiv.org Artificial Intelligence

Real-world datasets often contain outliers, and the presence of outliers can make the clustering problems to be much more challenging. In this paper, we propose a simple uniform sampling framework for solving three representative center-based clustering with outliers problems: $k$-center/median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. To explain the effectiveness of uniform sampling in theory, we introduce a measure of "significance" and prove that the performance of our framework depends on the significance degree of the given instance. In particular, the sample size can be independent of the input data size $n$ and the dimensionality $d$, if we assume the given instance is "significant", which is in fact a fairly reasonable assumption in practice. Due to its simplicity, the uniform sampling approach also enjoys several significant advantages over the non-uniform sampling approaches in practice. To the best of our knowledge, this is the first work that systematically studies the effectiveness of uniform sampling from both theoretical and experimental aspects.


Pool-Based Active Learning with Proper Topological Regions

arXiv.org Artificial Intelligence

In recent years, machine learning has found gainful applications in diverse domains, but it still has a heavy dependence on expensive labeled data: Advances in cheap computing and storage have made it easier to store and process large amounts of unlabeled data, but the labeling often needs to be done by humans or using costly tools. Therefore, there is a need to develop general domain-independent methods to learn models effectively from a large amount of unlabeled data at the disposal, along with a minimal amount of labeled data: this is the framework of semi-supervised learning. Active learning aims explicitly to detect the observations to be labeled to optimize the learning process and efficiently reduce the labeling cost. The primary assumption behind active learning is that machine learning algorithms could reach a higher level of performance while using a smaller number of training labels if they were allowed to choose the training dataset (Settles, 2009). The most common active learning approaches are pool-based methods (Lewis and Catlett, 1994) based on a set of unlabeled observations. First, some points are labeled to train a classification model, and then, at each iteration, we choose unlabeled examples to query based on the predictions of the current model and a predefined priority score. These approaches show their limitations in low-budget regime scenarios because they need a sufficient budget to learn a weak model (Pourahmadi et al., 2021). 1