AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Power Spectrum Signatures of Graphs

Djima, Karamatou Yacoubou, Yim, Ka Man

arXiv.org Machine LearningMar-14-2025

Point signatures based on the Laplacian operators on graphs, point clouds, and manifolds have become popular tools in machine learning for graphs, clustering, and shape analysis. In this work, we propose a novel point signature, the power spectrum signature, a measure on $\mathbb{R}$ defined as the squared graph Fourier transform of a graph signal. Unlike eigenvectors of the Laplacian from which it is derived, the power spectrum signature is invariant under graph automorphisms. We show that the power spectrum signature is stable under perturbations of the input graph with respect to the Wasserstein metric. We focus on the signature applied to classes of indicator functions, and its applications to generating descriptive features for vertices of graphs. To demonstrate the practical value of our signature, we showcase several applications in characterizing geometry and symmetries in point cloud data, and graph regression problems.

artificial intelligence, machine learning, signature, (13 more...)

arXiv.org Machine Learning

2503.0966

Country:

North America > United States (0.14)
Europe > United Kingdom (0.04)

Genre:

Overview (0.92)
Research Report (0.81)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

MUSS: Multilevel Subset Selection for Relevance and Diversity

Nguyen, Vu, Kan, Andrey

arXiv.org Artificial IntelligenceMar-14-2025

The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (MMR) are based on greedy selection. Many real-world applications involve large data, but the original MMR work did not consider distributed selection. This limitation was later addressed by a method called DGDS which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose MUSS, a novel method that uses a multilevel approach to relevant and diverse selection. We provide a rigorous theoretical analysis and show that our method achieves a constant factor approximation of the optimal objective. In a recommender system application, our method can achieve the same level of performance as baselines, but 4.5 to 20 times faster. Our method is also capable of outperforming baselines by up to 6 percent points of RAG-based question answering accuracy.

multilevel subset selection, muss, selection, (13 more...)

arXiv.org Artificial Intelligence

2503.11126

Country: Asia > Middle East > Iran > Tehran Province > Tehran (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance

Jin, Jiaqi, Wang, Siwei, Dong, Zhibin, Yang, Xihong, Liu, Xinwang, Zhu, En, He, Kunlun

arXiv.org Artificial IntelligenceMar-13-2025

Multi-view clustering leverages complementary representations from diverse sources to enhance performance. However, real-world data often suffer incomplete cases due to factors like privacy concerns and device malfunctions. A key challenge is effectively utilizing available instances to recover missing views. Existing methods frequently overlook the heterogeneity among views during recovery, leading to significant distribution discrepancies between recovered and true data. Additionally, many approaches focus on cross-view correlations, neglecting insights from intra-view reliable structure and cross-view clustering structure. To address these issues, we propose BURG, a novel method for incomplete multi-view clustering with distriBution dUal-consistency Recovery Guidance. We treat each sample as a distinct category and perform cross-view distribution transfer to predict the distribution space of missing views. To compensate for the lack of reliable category information, we design a dual-consistency guided recovery strategy that includes intra-view alignment guided by neighbor-aware consistency and cross-view alignment guided by prototypical consistency. Extensive experiments on benchmarks demonstrate the superiority of BURG in the incomplete multi-view scenario.

dataset, ieee transaction, representation, (15 more...)

arXiv.org Artificial Intelligence

2503.11017

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Hunan Province > Changsha (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Reconsidering Feature Structure Information and Latent Space Alignment in Partial Multi-label Feature Selection

Pan, Hanlin, Liu, Kunpeng, Gao, Wanfu

arXiv.org Artificial IntelligenceMar-13-2025

The purpose of partial multi-label feature selection is to select the most representative feature subset, where the data comes from partial multi-label datasets that have label ambiguity issues. For label disambiguation, previous methods mainly focus on utilizing the information inside the labels and the relationship between the labels and features. However, the information existing in the feature space is rarely considered, especially in partial multi-label scenarios where the noises is considered to be concentrated in the label space while the feature information is correct. This paper proposes a method based on latent space alignment, which uses the information mined in feature space to disambiguate in latent space through the structural consistency between labels and features. In addition, previous methods overestimate the consistency of features and labels in the latent space after convergence. We comprehensively consider the similarity of latent space projections to feature space and label space, and propose new feature selection term. This method also significantly improves the positive label identification ability of the selected features. Comprehensive experiments demonstrate the superiority of the proposed method.

feature selection, matrix, selection, (14 more...)

arXiv.org Artificial Intelligence

2503.10115

Country:

Asia > China (0.05)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Asia > South Korea (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Clustering by Nonparametric Smoothing

Hofmeyr, David P.

arXiv.org Machine LearningMar-12-2025

A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/ I. Introduction Cluster analysis refers to the task of partitioning a set of data into groups (or clusters) in such a way that points within the same cluster tend to be more similar than points in different clusters.

artificial intelligence, formulation, machine learning, (18 more...)

arXiv.org Machine Learning

2503.09134

Country:

Europe > Austria > Vienna (0.14)
Asia (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering

Galis, Fabian, Onchis, Darian

arXiv.org Artificial IntelligenceMar-12-2025

In the context of unsupervised learning, effective clustering plays a vital role in revealing patterns and insights from unlabeled data. However, the success of clustering algorithms often depends on the relevance and contribution of features, which can differ between various datasets. This paper explores feature weighting for clustering and presents new weighting strategies, including methods based on SHAP (SHapley Additive exPlanations), a technique commonly used for providing explainability in various supervised machine learning tasks. By taking advantage of SHAP values in a way other than just to gain explainability, we use them to weight features and ultimately improve the clustering process itself in unsupervised scenarios. Our empirical evaluations across five benchmark datasets and clustering methods demonstrate that feature weighting based on SHAP can enhance unsupervised clustering quality, achieving up to a 22.69\% improvement over other weighting methods (from 0.586 to 0.719 in terms of the Adjusted Rand Index). Additionally, these situations where the weighted data boosts the results are highlighted and thoroughly explored, offering insight for practical applications.

algorithm, dataset, shap, (15 more...)

arXiv.org Artificial Intelligence

2503.11706

Country:

North America > United States > Wisconsin (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
Oceania > Australia > Queensland (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

FedMSGL: A Self-Expressive Hypergraph Based Federated Multi-View Learning

Li, Daoyuan, Yang, Zuyuan, Xie, Shengli

arXiv.org Artificial IntelligenceMar-12-2025

Federated learning is essential for enabling collaborative model training across decentralized data sources while preserving data privacy and security. This approach mitigates the risks associated with centralized data collection and addresses concerns related to data ownership and compliance. Despite significant advancements in federated learning algorithms that address communication bottlenecks and enhance privacy protection, existing works overlook the impact of differences in data feature dimensions, resulting in global models that disproportionately depend on participants with large feature dimensions. Additionally, current single-view federated learning methods fail to account for the unique characteristics of multi-view data, leading to suboptimal performance in processing such data. To address these issues, we propose a Self-expressive Hypergraph Based Federated Multi-view Learning method (FedMSGL). The proposed method leverages self-expressive character in the local training to learn uniform dimension subspace with latent sample relation. At the central side, an adaptive fusion technique is employed to generate the global model, while constructing a hypergraph from the learned global and view-specific subspace to capture intricate interconnections across views. Experiments on multi-view datasets with different feature dimensions validated the effectiveness of the proposed method.

federated learning, learning, subspace, (14 more...)

arXiv.org Artificial Intelligence

2503.09643

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.66)

Add feedback

Double-Stage Feature-Level Clustering-Based Mixture of Experts Framework

Badjie, Bakary, Cecílio, José, Casimiro, António

arXiv.org Artificial IntelligenceMar-12-2025

The Mixture-of-Experts (MoE) model has succeeded in deep learning (DL). However, its complex architecture and advantages over dense models in image classification remain unclear. In previous studies, MoE performance has often been affected by noise and outliers in the input space. Some approaches incorporate input clustering for training MoE models, but most clustering algorithms lack access to labeled data, limiting their effectiveness. This paper introduces the Double-stage Feature-level Clustering and Pseudo-labeling-based Mixture of Experts (DFCP-MoE) framework, which consists of input feature extraction, feature-level clustering, and a computationally efficient pseudo-labeling strategy. This approach reduces the impact of noise and outliers while leveraging a small subset of labeled data to label a large portion of unlabeled inputs. We propose a conditional end-to-end joint training method that improves expert specialization by training the MoE model on well-labeled, clustered inputs. Unlike traditional MoE and dense models, the DFCP-MoE framework effectively captures input space diversity, leading to competitive inference results. We validate our approach on three benchmark datasets for multi-class classification tasks.

dense model, dfcp-moe framework, expert model, (15 more...)

arXiv.org Artificial Intelligence

2503.09504

Country:

Asia > Middle East > Jordan (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Neural Normalized Cut: A Differential and Generalizable Approach for Spectral Clustering

He, Wei, Zhang, Shangzhi, Li, Chun-Guang, Qi, Xianbiao, Xiao, Rong, Guo, Jun

arXiv.org Artificial IntelligenceMar-12-2025

Spectral clustering, as a popular tool for data clustering, requires an eigen-decomposition step on a given affinity to obtain the spectral embedding. Nevertheless, such a step suffers from the lack of generalizability and scalability. Moreover, the obtained spectral embeddings can hardly provide a good approximation to the ground-truth partition and thus a k-means step is adopted to quantize the embedding. In this paper, we propose a simple yet effective scalable and generalizable approach, called Neural Normalized Cut (NeuNcut), to learn the clustering membership for spectral clustering directly. In NeuNcut, we properly reparameterize the unknown cluster membership via a neural network, and train the neural network via stochastic gradient descent with a properly relaxed normalized cut loss. As a result, our NeuNcut enjoys a desired generalization ability to directly infer clustering membership for out-of-sample unseen data and hence brings us an efficient way to handle clustering task with ultra large-scale data. We conduct extensive experiments on both synthetic data and benchmark datasets and experimental results validate the effectiveness and the superiority of our approach. Our code is available at: https://github.com/hewei98/NeuNcut.

affinity, neuncut, spectral, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.patcog.2025.111545

2503.0926

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > China > Beijing > Beijing (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Freeze and Cluster: A Simple Baseline for Rehearsal-Free Continual Category Discovery

Zhang, Chuyu, Yu, Xueyang, Gu, Peiyan, He, Xuming

arXiv.org Artificial IntelligenceMar-12-2025

This paper addresses the problem of Rehearsal-Free Continual Category Discovery (RF-CCD), which focuses on continuously identifying novel class by leveraging knowledge from labeled data. Existing methods typically train from scratch, overlooking the potential of base models, and often resort to data storage to prevent forgetting. Moreover, because RF-CCD encompasses both continual learning and novel class discovery, previous approaches have struggled to effectively integrate advanced techniques from these fields, resulting in less convincing comparisons and failing to reveal the unique challenges posed by RF-CCD. To address these challenges, we lead the way in integrating advancements from both domains and conducting extensive experiments and analyses. Our findings demonstrate that this integration can achieve state-of-the-art results, leading to the conclusion that "in the presence of pre-trained models, the representation does not improve and may even degrade with the introduction of unlabeled data." To mitigate representation degradation, we propose a straightforward yet highly effective baseline method. This method first utilizes prior knowledge of known categories to estimate the number of novel classes. It then acquires representations using a model specifically trained on the base classes, generates high-quality pseudo-labels through k-means clustering, and trains only the classifier layer. The results clearly illustrate our findings, demonstrate the effectiveness of our baseline, and pave the way for future advancements in RF-CCD. Humans possess the ability to continuously learn new knowledge in ever-changing environments with limited supervision. Inspired by this capability, several studies have proposed the problem of continual novel class discovery (Roy et al., 2022; Joseph et al., 2022), aiming to enable models to continuously capture new categories from unlabeled data.

discovery, learning, novel class, (15 more...)

arXiv.org Artificial Intelligence

2503.09106

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback