AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density

Wang, Ziqing, Ye, Zhirong, Du, Yuyang, Mao, Yi, Liu, Yanying, Wu, Ziling, Wang, Jun

arXiv.org Artificial IntelligenceOct-9-2023

DBSCAN has been widely used in density-based clustering algorithms. However, with the increasing demand for Multi-density clustering, previous traditional DSBCAN can not have good clustering results on Multi-density datasets. In order to address this problem, an adaptive Multi-density DBSCAN algorithm (AMD-DBSCAN) is proposed in this paper. An improved parameter adaptation method is proposed in AMD-DBSCAN to search for multiple parameter pairs (i.e., Eps and MinPts), which are the key parameters to determine the clustering results and performance, therefore allowing the model to be applied to Multi-density datasets. Moreover, only one hyperparameter is required for AMD-DBSCAN to avoid the complicated repetitive initialization operations. Furthermore, the variance of the number of neighbors (VNN) is proposed to measure the difference in density between each cluster. The experimental results show that our AMD-DBSCAN reduces execution time by an average of 75% due to lower algorithm complexity compared with the traditional adaptive algorithm. In addition, AMD-DBSCAN improves accuracy by 24.7% on average over the state-of-the-art design on Multi-density datasets of extremely variable density, while having no performance loss in Single-density scenarios. Our code and datasets are available at https://github.com/AlexandreWANG915/AMD-DBSCAN.

algorithm, dataset, dbscan, (15 more...)

arXiv.org Artificial Intelligence

2210.08162

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Guangdong Province > Zhuhai (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Transferable Deep Clustering Model

Zhang, Zheng, Zhao, Liang

arXiv.org Artificial IntelligenceOct-7-2023

Deep learning has shown remarkable success in the field of clustering recently. However, how to transfer a trained clustering model on a source domain to a target domain by leveraging the acquired knowledge to guide the clustering process remains challenging. Existing deep clustering methods often lack generalizability to new domains because they typically learn a group of fixed cluster centroids, which may not be optimal for the new domain distributions. In this paper, we propose a novel transferable deep clustering model that can automatically adapt the cluster centroids according to the distribution of data samples. Rather than learning a fixed set of centroids, our approach introduces a novel attention-based module that can adapt the centroids by measuring their relationship with samples. In addition, we theoretically show that our model is strictly more powerful than some classical clustering algorithms such as k-means or Gaussian Mixture Model (GMM). Experimental results on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of our proposed transfer learning framework, which significantly improves the performance on target domain and reduces the computational cost.

transferable deep clustering model

arXiv.org Artificial Intelligence

2310.04946

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.53)

Add feedback

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

Chu, Tianzhe, Tong, Shengbang, Ding, Tianjiao, Dai, Xili, Haeffele, Benjamin David, Vidal, René, Ma, Yi

arXiv.org Artificial IntelligenceOct-7-2023

The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57% to 66% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful text labels for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets without predefined labels, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.

arxiv preprint arxiv, dataset, representation, (14 more...)

arXiv.org Artificial Intelligence

2306.05272

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Pennsylvania (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Education (0.66)
Transportation (0.47)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.51)

Add feedback

Clustering Categorical Data: Soft Rounding k-modes

Gavva, Surya Teja, S., Karthik C., Punna, Sharath

arXiv.org Artificial IntelligenceOct-7-2023

Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.

algorithm, k-mode algorithm, softmode, (15 more...)

arXiv.org Artificial Intelligence

2210.0964

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Virginia > Fairfax County > McLean (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Joint Projection Learning and Tensor Decomposition Based Incomplete Multi-view Clustering

Lv, Wei, Zhang, Chao, Li, Huaxiong, Jia, Xiuyi, Chen, Chunlin

arXiv.org Artificial IntelligenceOct-6-2023

Incomplete multi-view clustering (IMVC) has received increasing attention since it is often that some views of samples are incomplete in reality. Most existing methods learn similarity subgraphs from original incomplete multi-view data and seek complete graphs by exploring the incomplete subgraphs of each view for spectral clustering. However, the graphs constructed on the original high-dimensional data may be suboptimal due to feature redundancy and noise. Besides, previous methods generally ignored the graph noise caused by the inter-class and intra-class structure variation during the transformation of incomplete graphs and complete graphs. To address these problems, we propose a novel Joint Projection Learning and Tensor Decomposition Based method (JPLTD) for IMVC. Specifically, to alleviate the influence of redundant features and noise in high-dimensional data, JPLTD introduces an orthogonal projection matrix to project the high-dimensional features into a lower-dimensional space for compact feature learning.Meanwhile, based on the lower-dimensional space, the similarity graphs corresponding to instances of different views are learned, and JPLTD stacks these graphs into a third-order low-rank tensor to explore the high-order correlations across different views. We further consider the graph noise of projected data caused by missing samples and use a tensor-decomposition based graph filter for robust clustering.JPLTD decomposes the original tensor into an intrinsic tensor and a sparse tensor. The intrinsic tensor models the true data similarities. An effective optimization algorithm is adopted to solve the JPLTD model. Comprehensive experiments on several benchmark datasets demonstrate that JPLTD outperforms the state-of-the-art methods. The code of JPLTD is available at https://github.com/weilvNJU/JPLTD.

graph, ieee trans, tensor, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNNLS.2023.3306006

2310.04038

Country: Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Robust convex biclustering with a tuning-free method

Chen, Yifan, Lei, Chunyin, Li, Chuanquan, Ma, Haiqiang, Hu, Ningyuan

arXiv.org Machine LearningOct-6-2023

Biclustering is widely used in different kinds of fields including gene information analysis, text mining, and recommendation system by effectively discovering the local correlation between samples and features. However, many biclustering algorithms will collapse when facing heavy-tailed data. In this paper, we propose a robust version of convex biclustering algorithm with Huber loss. Yet, the newly introduced robustification parameter brings an extra burden to selecting the optimal parameters. Therefore, we propose a tuning-free method for automatically selecting the optimal robustification parameter with high efficiency. The simulation study demonstrates the more fabulous performance of our proposed method than traditional biclustering methods when encountering heavy-tailed noise. A real-life biomedical application is also presented. The R package RcvxBiclustr is available at https://github.com/YifanChen3/RcvxBiclustr.

algorithm, convex, matrix, (15 more...)

arXiv.org Machine Learning

2212.03122

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.14)
Europe > United Kingdom (0.14)
Asia > China > Jiangxi Province > Nanchang (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

A Process for Topic Modelling Via Word Embeddings

Ulloa, Diego Saldaña

arXiv.org Artificial IntelligenceOct-6-2023

This work combines algorithms based on word embeddings, dimensionality reduction, and clustering. The objective is to obtain topics from a set of unclassified texts. The algorithm to obtain the word embeddings is the BERT model, a neural network architecture widely used in NLP tasks. Due to the high dimensionality, a dimensionality reduction technique called UMAP is used. This method manages to reduce the dimensions while preserving part of the local and global information of the original data. K-Means is used as the clustering algorithm to obtain the topics. Then, the topics are evaluated using the TF-IDF statistics, Topic Diversity, and Topic Coherence to get the meaning of the words on the clusters. The results of the process show good values, so the topic modeling of this process is a viable option for classifying or clustering texts without labels.

algorithm, information, topic modeling, (12 more...)

arXiv.org Artificial Intelligence

2312.03705

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Mexico > Puebla > Puebla (0.04)
North America > Canada (0.04)

Genre: Research Report (0.50)

Industry: Law > Statutes (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Universal Detection of Backdoor Attacks via Density-based Clustering and Centroids Analysis

Guo, Wei, Tondi, Benedetta, Barni, Mauro

arXiv.org Artificial IntelligenceOct-5-2023

We propose a Universal Defence against backdoor attacks based on Clustering and Centroids Analysis (CCA-UD). The goal of the defence is to reveal whether a Deep Neural Network model is subject to a backdoor attack by inspecting the training dataset. CCA-UD first clusters the samples of the training set by means of density-based clustering. Then, it applies a novel strategy to detect the presence of poisoned clusters. The proposed strategy is based on a general misclassification behaviour observed when the features of a representative example of the analysed cluster are added to benign samples. The capability of inducing a misclassification error is a general characteristic of poisoned samples, hence the proposed defence is attack-agnostic. This marks a significant difference with respect to existing defences, that, either can defend against only some types of backdoor attacks, or are effective only when some conditions on the poisoning ratio or the kind of triggering signal used by the attacker are satisfied. Experiments carried out on several classification tasks and network architectures, considering different types of backdoor attacks (with either clean or corrupted labels), and triggering signals, including both global and local triggering signals, as well as sample-specific and source-specific triggers, reveal that the proposed method is very effective to defend against backdoor attacks in all the cases, always outperforming the state of the art techniques.

backdoor attack, cca-ud, dataset, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TIFS.2023.3329426

2301.04554

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Italy (0.04)
Asia > China (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Is Simple Uniform Sampling Effective for Center-Based Clustering with Outliers: When and Why?

Huang, Jiawei, Liu, Wenjie, Ding, Hu

arXiv.org Artificial IntelligenceOct-3-2023

Real-world datasets often contain outliers, and the presence of outliers can make the clustering problems to be much more challenging. In this paper, we propose a simple uniform sampling framework for solving three representative center-based clustering with outliers problems: $k$-center/median/means clustering with outliers. Our analysis is fundamentally different from the previous (uniform and non-uniform) sampling based ideas. To explain the effectiveness of uniform sampling in theory, we introduce a measure of "significance" and prove that the performance of our framework depends on the significance degree of the given instance. In particular, the sample size can be independent of the input data size $n$ and the dimensionality $d$, if we assume the given instance is "significant", which is in fact a fairly reasonable assumption in practice. Due to its simplicity, the uniform sampling approach also enjoys several significant advantages over the non-uniform sampling approaches in practice. To the best of our knowledge, this is the first work that systematically studies the effectiveness of uniform sampling from both theoretical and experimental aspects.

algorithm, outlier, probability, (15 more...)

arXiv.org Artificial Intelligence

2103.00558

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Illinois (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Pool-Based Active Learning with Proper Topological Regions

Hadjadj, Lies, Devijver, Emilie, Molinier, Remi, Amini, Massih-Reza

arXiv.org Artificial IntelligenceOct-2-2023

In recent years, machine learning has found gainful applications in diverse domains, but it still has a heavy dependence on expensive labeled data: Advances in cheap computing and storage have made it easier to store and process large amounts of unlabeled data, but the labeling often needs to be done by humans or using costly tools. Therefore, there is a need to develop general domain-independent methods to learn models effectively from a large amount of unlabeled data at the disposal, along with a minimal amount of labeled data: this is the framework of semi-supervised learning. Active learning aims explicitly to detect the observations to be labeled to optimize the learning process and efficiently reduce the labeling cost. The primary assumption behind active learning is that machine learning algorithms could reach a higher level of performance while using a smaller number of training labels if they were allowed to choose the training dataset (Settles, 2009). The most common active learning approaches are pool-based methods (Lewis and Catlett, 1994) based on a set of unlabeled observations. First, some points are labeled to train a classification model, and then, at each iteration, we choose unlabeled examples to query based on the predictions of the current model and a predefined priority score. These approaches show their limitations in low-budget regime scenarios because they need a sufficient budget to learn a weak model (Pourahmadi et al., 2021). 1

graph, learning, topological region, (12 more...)

arXiv.org Artificial Intelligence

2310.01597

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New Jersey > Mercer County > Princeton (0.14)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback