AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Deep Clustering via Probabilistic Ratio-Cut Optimization

Ghriss, Ayoub, Monteleoni, Claire

arXiv.org Artificial IntelligenceFeb-5-2025

We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The clustering resulting from our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering closely aligns with the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised representations to achieve competitive performance and serve as an evaluation method for the quality of these representations.

data mining, lemma 4, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.03405

Country:

North America > United States > Colorado > Boulder County > Boulder (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Thailand (0.04)

Genre: Research Report > Promising Solution (0.54)

Industry: Education (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)

Add feedback

Interaction-Aware Gaussian Weighting for Clustered Federated Learning

Licciardi, Alessandro, Leo, Davide, Faní, Eros, Caputo, Barbara, Ciccone, Marco

arXiv.org Artificial IntelligenceFeb-5-2025

Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy. However, conventional FL struggles with data heterogeneity and class imbalance, which degrade model performance. Clustered FL balances personalization and decentralized training by grouping clients with analogous data distributions, enabling improved accuracy while adhering to privacy constraints. This approach effectively mitigates the adverse impact of heterogeneity in FL. In this work, we propose a novel clustered FL method, FedGWC (Federated Gaussian Weighting Clustering), which groups clients based on their data distribution, allowing training of a more robust and personalized model on the identified clusters. FedGWC identifies homogeneous clusters by transforming individual empirical losses to model client interactions with a Gaussian reward mechanism. Additionally, we introduce the Wasserstein Adjusted Score, a new clustering metric for FL to evaluate cluster cohesion with respect to the individual class distribution. Our experiments on benchmark datasets show that FedGWC outperforms existing FL algorithms in cluster quality and classification accuracy, validating the efficacy of our approach.

fedgwc, interaction-aware gaussian weighting, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.0334

Country:

Europe > Italy > Piedmont > Turin Province > Turin (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Data denoising with self consistency, variance maximization, and the Kantorovich dominance

Hiew, Joshua Zoen-Git, Lim, Tongseok, Pass, Brendan, de Souza, Marcelo Cruz

arXiv.org Artificial IntelligenceFeb-5-2025

We introduce a new framework for data denoising, partially inspired by martingale optimal transport. For a given noisy distribution (the data), our approach involves finding the closest distribution to it among all distributions which 1) have a particular prescribed structure (expressed by requiring they lie in a particular domain), and 2) are self-consistent with the data. We show that this amounts to maximizing the variance among measures in the domain which are dominated in convex order by the data. For particular choices of the domain, this problem and a relaxed version of it, in which the self-consistency condition is removed, are intimately related to various classical approaches to denoising. We prove that our general problem has certain desirable features: solutions exist under mild assumptions, have certain robustness properties, and, for very simple domains, coincide with solutions to the relaxed problem. We also introduce a novel relationship between distributions, termed Kantorovich dominance, which retains certain aspects of the convex order while being a weaker, more robust, and easier-to-verify condition. Building on this, we propose and analyze a new denoising problem by substituting the convex order in the previously described framework with Kantorovich dominance. We demonstrate that this revised problem shares some characteristics with the full convex order problem but offers enhanced stability, greater computational efficiency, and, in specific domains, more meaningful solutions. Finally, we present simple numerical examples illustrating solutions for both the full convex order problem and the Kantorovich dominance problem.

artificial intelligence, machine learning, var, (16 more...)

arXiv.org Artificial Intelligence

2502.02925

Country:

South America > Brazil > Ceará > Fortaleza (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Minimax-Optimal Dimension-Reduced Clustering for High-Dimensional Nonspherical Mixtures

Huang, Chengzhu, Gu, Yuqi

arXiv.org Machine LearningFeb-4-2025

In mixture models, nonspherical (anisotropic) noise within each cluster is widely present in real-world data. We study both the minimax rate and optimal statistical procedure for clustering under high-dimensional nonspherical mixture models. In high-dimensional settings, we first establish the information-theoretic limits for clustering under Gaussian mixtures. The minimax lower bound unveils an intriguing informational dimension-reduction phenomenon: there exists a substantial gap between the minimax rate and the oracle clustering risk, with the former determined solely by the projected centers and projected covariance matrices in a low-dimensional space. Motivated by the lower bound, we propose a novel computationally efficient clustering method: Covariance Projected Spectral Clustering (COPO). Its key step is to project the high-dimensional data onto the low-dimensional space spanned by the cluster centers and then use the projected covariance matrices in this space to enhance clustering. We establish tight algorithmic upper bounds for COPO, both for Gaussian noise with flexible covariance and general noise with local dependence. Our theory indicates the minimax-optimality of COPO in the Gaussian case and highlights its adaptivity to a broad spectrum of dependent noise. Extensive simulation studies under various noise structures and real data analysis demonstrate our method's superior performance.

artificial intelligence, machine learning, probability, (15 more...)

arXiv.org Machine Learning

2502.0258

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (1.00)
Workflow (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

An Efficient Local Search Approach for Polarized Community Discovery in Signed Networks

Aronsson, Linus, Chehreghani, Morteza Haghir

arXiv.org Artificial IntelligenceFeb-4-2025

Signed networks, where edges are labeled as positive or negative to indicate friendly or antagonistic interactions, offer a natural framework for studying polarization, trust, and conflict in social systems. Detecting meaningful group structures in these networks is crucial for understanding online discourse, political division, and trust dynamics. A key challenge is to identify groups that are cohesive internally yet antagonistic externally, while allowing for neutral or unaligned vertices. In this paper, we address this problem by identifying $k$ polarized communities that are large, dense, and balanced in size. We develop an approach based on Frank-Wolfe optimization, leading to a local search procedure with provable convergence guarantees. Our method is both scalable and efficient, outperforming state-of-the-art baselines in solution quality while remaining competitive in terms of computational efficiency.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Artificial Intelligence

2502.02197

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)
(15 more...)

Genre: Research Report (0.82)

Industry:

Government (0.67)
Media > News (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A New Rejection Sampling Approach to $k$-$\mathtt{means}$++ With Improved Trade-Offs

Shah, Poojan, Agrawal, Shashwat, Jaiswal, Ragesh

arXiv.org Artificial IntelligenceFeb-4-2025

The $k$-$\mathtt{means}$++ seeding algorithm (Arthur & Vassilvitskii, 2007) is widely used in practice for the $k$-means clustering problem where the goal is to cluster a dataset $\mathcal{X} \subset \mathbb{R} ^d$ into $k$ clusters. The popularity of this algorithm is due to its simplicity and provable guarantee of being $O(\log k)$ competitive with the optimal solution in expectation. However, its running time is $O(|\mathcal{X}|kd)$, making it expensive for large datasets. In this work, we present a simple and effective rejection sampling based approach for speeding up $k$-$\mathtt{means}$++. Our first method runs in time $\tilde{O}(\mathtt{nnz} (\mathcal{X}) + \beta k^2d)$ while still being $O(\log k )$ competitive in expectation. Here, $\beta$ is a parameter which is the ratio of the variance of the dataset to the optimal $k$-$\mathtt{means}$ cost in expectation and $\tilde{O}$ hides logarithmic factors in $k$ and $|\mathcal{X}|$. Our second method presents a new trade-off between computational cost and solution quality. It incurs an additional scale-invariant factor of $ k^{-\Omega( m/\beta)} \operatorname{Var} (\mathcal{X})$ in addition to the $O(\log k)$ guarantee of $k$-$\mathtt{means}$++ improving upon a result of (Bachem et al, 2016a) who get an additional factor of $m^{-1}\operatorname{Var}(\mathcal{X})$ while still running in time $\tilde{O}(\mathtt{nnz}(\mathcal{X}) + mk^2d)$. We perform extensive empirical evaluations to validate our theoretical results and to show the effectiveness of our approach on real datasets.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.02085

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Mask-informed Deep Contrastive Incomplete Multi-view Clustering

Li, Zhenglai, Shi, Yuqi, He, Xiao, Tang, Chang

arXiv.org Artificial IntelligenceFeb-4-2025

Multi-view clustering (MvC) utilizes information from multiple views to uncover the underlying structures of data. Despite significant advancements in MvC, mitigating the impact of missing samples in specific views on the integration of knowledge from different views remains a critical challenge. This paper proposes a novel Mask-informed Deep Contrastive Incomplete Multi-view Clustering (Mask-IMvC) method, which elegantly identifies a view-common representation for clustering. Specifically, we introduce a mask-informed fusion network that aggregates incomplete multi-view information while considering the observation status of samples across various views as a mask, thereby reducing the adverse effects of missing values. Additionally, we design a prior knowledge-assisted contrastive learning loss that boosts the representation capability of the aggregated view-common representation by injecting neighborhood information of samples from different views. Finally, extensive experiments are conducted to demonstrate the superiority of the proposed Mask-IMvC method over state-of-the-art approaches across multiple MvC datasets, both in complete and incomplete scenarios.

artificial intelligence, deep learning, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2502.02234

Country:

Asia > Macao (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Beyond Win Rates: A Clustering-Based Approach to Character Balance Analysis in Team-Based Games

Zhou, Haokun

arXiv.org Artificial IntelligenceFeb-3-2025

Character diversity in competitive games, while enriching gameplay, often introduces balance challenges that can negatively impact player experience and strategic depth. Traditional balance assessments rely on aggregate metrics like win rates and pick rates, which offer limited insight into the intricate dynamics of team-based games and nuanced character roles. This paper proposes a novel clustering-based methodology to analyze character balance, leveraging in-game data from Valorant to account for team composition influences and reveal latent character roles. By applying hierarchical agglomerative clustering with Jensen-Shannon Divergence to professional match data from the Valorant Champions Tour 2022, our approach identifies distinct clusters of agents exhibiting similar co-occurrence patterns within team compositions. This method not only complements existing quantitative metrics but also provides a more holistic and interpretable perspective on character synergies and potential imbalances, offering game developers a valuable tool for informed and context-aware balance adjustments.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.0125

Country:

Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
Asia > China > Zhejiang Province > Ningbo (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Data Science > Data Mining (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Hierarchical Consensus Network for Multiview Feature Learning

Xia, Chengwei, Niu, Chaoxi, Zhan, Kun

arXiv.org Artificial IntelligenceFeb-3-2025

Multiview feature learning aims to learn discriminative features by integrating the distinct information in each view. However, most existing methods still face significant challenges in learning view-consistency features, which are crucial for effective multiview learning. Motivated by the theories of CCA and contrastive learning in multiview feature learning, we propose the hierarchical consensus network (HCN) in this paper. The HCN derives three consensus indices for capturing the hierarchical consensus across views, which are classifying consensus, coding consensus, and global consensus, respectively. Specifically, classifying consensus reinforces class-level correspondence between views from a CCA perspective, while coding consensus closely resembles contrastive learning and reflects contrastive comparison of individual instances. Global consensus aims to extract consensus information from two perspectives simultaneously. By enforcing the hierarchical consensus, the information within each view is better integrated to obtain more comprehensive and discriminative features. The extensive experimental results obtained on four multiview datasets demonstrate that the proposed method significantly outperforms several state-of-the-art methods.

artificial intelligence, consensus, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.01961

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Gansu Province > Lanzhou (0.04)
Asia > China > Qinghai Province (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Auditing a Dutch Public Sector Risk Profiling Algorithm Using an Unsupervised Bias Detection Tool

Holstege, Floris, Jorgensen, Mackenzie, Padh, Kirtan, Parie, Jurriaan, Persson, Joel, Prorokovic, Krsto, Snoek, Lukas

arXiv.org Artificial IntelligenceFeb-3-2025

Algorithms are increasingly used to automate or aid human decisions, yet recent research shows that these algorithms may exhibit bias across legally protected demographic groups. However, data on these groups may be unavailable to organizations or external auditors due to privacy legislation. This paper studies bias detection using an unsupervised clustering tool when data on demographic groups are unavailable. We collaborate with the Dutch Executive Agency for Education to audit an algorithm that was used to assign risk scores to college students at the national level in the Netherlands between 2012-2023. Our audit covers more than 250,000 students from the whole country. The unsupervised clustering tool highlights known disparities between students with a non-European migration background and Dutch origin. Our contributions are three-fold: (1) we assess bias in a real-world, large-scale and high-stakes decision-making process by a governmental organization; (2) we use simulation studies to highlight potential pitfalls of using the unsupervised clustering tool to detect true bias when demographic group data are unavailable and provide recommendations for valid inferences; (3) we provide the unsupervised clustering tool in an open-source library. Our work serves as a starting point for a deliberative assessment by human experts to evaluate potential discrimination in algorithmic-supported decision-making processes.

artificial intelligence, data mining, machine learning, (9 more...)

arXiv.org Artificial Intelligence

2502.01713

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government (1.00)
Education > Educational Setting > Higher Education (1.00)
Law > Statutes (0.86)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback