AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Unsupervised outlier detection to improve bird audio dataset labels

Collins, Bruce

arXiv.org Artificial IntelligenceApr-29-2025

The Xeno -Canto bird audio repository is an invaluable resource for those interested in vocalizations and other sounds made by birds around the world. This is particularly the case for machine learning researchers attempting to improve on the bird species r ecognition accuracy of classification models. However, the task of extracting labeled datasets from th e recordings found in this crowd -sourced repository faces several challenges. One challenge of particular significance to machine learning practitioners i s that one bird species label is applied to each audio recording, but frequently other sounds are also captured including other bird species, other animal sounds, anthropogenic and other ambient sounds . These non -target bird species sounds can result in dataset labeling discrepanc ies referred to as label noise . In this work we present a cleaning process consisting of audio preprocessing followed by dimensionality reduction and unsupervised outlier detection (UOD) to reduce the label noise in a dataset derived from Xeno -Canto recordings . We investigate three neural network dimensionality reduction techniques: two flavors of convolutional autoencoder s and variational deep embedding (VaDE (Jiang, 2017)) . While both methods show some degree of effectiveness at detecting outliers for most bird species datasets, we f ound significant variation in the performance of the methods from one species to the next. We believe that the results of this investigation demonstrate that the application of our cleaning process can meaningfully reduce the label noise of bird species datasets derived from Xeno-Canto audio repository but results vary across species.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2504.1865

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.93)

Industry:

Media > Music (0.48)
Leisure & Entertainment (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Label-independent hyperparameter-free self-supervised single-view deep subspace clustering

Sindicic, Lovro, Kopriva, Ivica

arXiv.org Artificial IntelligenceApr-28-2025

Deep subspace clustering (DSC) algorithms face several challenges that hinder their widespread adoption across variois application domains. First, clustering quality is typically assessed using only the encoder's output layer, disregarding valuable information present in the intermediate layers. Second, most DSC approaches treat representation learning and subspace clustering as independent tasks, limiting their effectiveness. Third, they assume the availability of a held-out dataset for hyperparameter tuning, which is often impractical in real-world scenarios. Fourth, learning termination is commonly based on clustering error monitoring, requiring external labels. Finally, their performance often depends on post-processing techniques that rely on labeled data. To address this limitations, we introduce a novel single-view DSC approach that: (i) minimizes a layer-wise self expression loss using a joint representation matrix; (ii) optimizes a subspace-structured norm to enhance clustering quality; (iii) employs a multi-stage sequential learning framework, consisting of pre-training and fine-tuning, enabling the use of multiple regularization terms without hyperparameter tuning; (iv) incorporates a relative error-based self-stopping mechanism to terminate training without labels; and (v) retains a fixed number of leading coefficients in the learned representation matrix based on prior knowledge. We evaluate the proposed method on six datasets representing faces, digits, and objects. The results show that our method outperforms most linear SC algorithms with careffulyl tuned hyperparameters while maintaining competitive performance with the best performing linear appoaches.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2504.18179

Country: Europe (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Random-Set Large Language Models

Mubashar, Muhammad, Manchingal, Shireen Kudukkil, Cuzzolin, Fabio

arXiv.org Artificial IntelligenceApr-28-2025

Large Language Models (LLMs) are known to produce very high-quality tests and responses to our queries. But how much can we trust this generated text? In this paper, we study the problem of uncertainty quantification in LLMs. We propose a novel Random-Set Large Language Model (RSLLM) approach which predicts finite random sets (belief functions) over the token space, rather than probability vectors as in classical LLMs. In order to allow so efficiently, we also present a methodology based on hierarchical clustering to extract and use a budget of "focal" subsets of tokens upon which the belief prediction is defined, rather than using all possible collections of tokens, making the method scalable yet effective. RS-LLMs encode the epistemic uncertainty induced in their generation process by the size and diversity of its training set via the size of the credal sets associated with the predicted belief functions. The proposed approach is evaluated on CoQA and OBQA datasets using Llama2-7b, Mistral-7b and Phi-2 models and is shown to outperform the standard model in both datasets in terms of correctness of answer while also showing potential in estimating the second level uncertainty in its predictions and providing the capability to detect when its hallucinating.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.18085

Country:

North America > United States (0.93)
Europe (0.67)

Genre: Research Report (0.64)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Graph-based Semi-supervised and Unsupervised Methods for Local Clustering

Shen, Zhaiming, Kang, Sung Ha

arXiv.org Machine LearningApr-27-2025

Local clustering aims to identify specific substructures within a large graph without requiring full knowledge of the entire graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data is given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state-of-the-arts results in the low-label rates regime.

data mining, machine learning, node, (17 more...)

arXiv.org Machine Learning

2504.19419

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Data Science > Data Mining (0.94)

Add feedback

Statistical Inference for Clustering-based Anomaly Detection

Phu, Nguyen Thi Minh, Loc, Duong Tan, Duy, Vo Nguyen Le

arXiv.org Machine LearningApr-25-2025

Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level $\alpha$ (e.g., $\alpha = 0.05$). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2504.18633

Country:

North America > United States > Wisconsin (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)
Asia > Japan (0.04)

Genre: Research Report > Experimental Study (0.34)

Industry: Health & Medicine > Therapeutic Area (0.95)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

CF-CAM: Cluster Filter Class Activation Mapping for Reliable Gradient-Based Interpretability

He, Hongjie, Pan, Xu, Yao, Yudong

arXiv.org Artificial IntelligenceApr-24-2025

As deep learning continues to advance, the transparency of neural network decision-making remains a critical challenge, limiting trust and applicability in high-stakes domains. Class Activation Mapping (CAM) techniques have emerged as a key approach toward visualizing model decisions, yet existing methods face inherent trade-offs. Gradient-based CAM variants suffer from sensitivity to gradient perturbations due to gradient noise, leading to unstable and unreliable explanations. Conversely, gradient-free approaches mitigate gradient instability but incur significant computational overhead and inference latency. To address these limitations, we propose a Cluster Filter Class Activation Map (CF-CAM) technique, a novel framework that reintroduces gradient-based weighting while enhancing robustness against gradient noise. CF-CAM utilizes hierarchical importance weighting strategy to balance discriminative feature preservation and noise elimination. A density-aware channel clustering method via Density-Based Spatial Clustering of Applications with Noise (DBSCAN) groups semantically relevant feature channels and discard noise-prone activations. Additionally, cluster-conditioned gradient filtering leverages Gaussian filters to refine gradient signals, preserving edge-aware localization while suppressing noise impact. Experiment results demonstrate that CF-CAM achieves superior interpretability performance while enhancing computational efficiency, outperforming state-of-the-art CAM methods in faithfulness and robustness. By effectively mitigating gradient instability without excessive computational cost, CF-CAM provides a competitive solution for enhancing the interpretability of deep neural networks in critical applications such as autonomous driving and medical diagnosis.

artificial intelligence, cf-cam, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.0006

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > China > Anhui Province > Hefei (0.04)
Oceania > Australia > Western Australia > Perth (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Signal Recovery from Random Dot-Product Graphs Under Local Differential Privacy

Vishwanath, Siddharth, Hehir, Jonathan

arXiv.org Machine LearningApr-24-2025

We consider the problem of recovering latent information from graphs under $\varepsilon$-edge local differential privacy where the presence of relationships/edges between two users/vertices remains confidential, even from the data curator. For the class of generalized random dot-product graphs, we show that a standard local differential privacy mechanism induces a specific geometric distortion in the latent positions. Leveraging this insight, we show that consistent recovery of the latent positions is achievable by appropriately adjusting the statistical inference procedure for the privatized graph. Furthermore, we prove that our procedure is nearly minimax-optimal under local edge differential privacy constraints. Lastly, we show that this framework allows for consistent recovery of geometric and topological information underlying the latent positions, as encoded in their persistence diagrams. Our results extend previous work from the private community detection literature to a substantially richer class of models and inferential tasks.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2504.17274

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
(2 more...)

Add feedback

Mining Software Repositories for Expert Recommendation

Marshall, Chad, Barovic, Andrew, Moin, Armin

arXiv.org Artificial IntelligenceApr-24-2025

--We propose an automated approach to bug assignment to developers in large open-source software projects. This way, we assist human bug triagers who are in charge of finding the best developer with the right level of expertise in a particular area to be assigned to a newly reported issue. Our approach is based on the history of software development as documented in the issue tracking systems. Our approach works based on the bug reports' features, such as the corresponding products and components, as well as their priority and severity levels. We sort developers based on their experience with specific combinations of new reports. The evaluation is performed using T op-k accuracy, and the results are compared with the reported results in prior work, namely T opicMiner MTM, BUGZIE, Bug triaging via deep Reinforcement Learning BT -RL, and LDA-SVM. The evaluation data come from various Eclipse and Mozilla projects, such as JDT, Firefox, and Thunderbird. Large open-source projects offer an issue tracking system or open bug repository, where developers and users can report the software defects they find or any new feature requests they may have. These reports are called bug reports or issues . In some cases, developers can volunteer to work on the reported issues they find interesting or relevant to their field of expertise. Additionally, they sometimes report issues and assign them to themselves. However, in many cases, particularly in large open-source projects, a group of developers, called bug triagers, decide who should process and fix a newly reported issue.

developer, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.16343

Country: North America > United States (0.93)

Genre: Research Report (1.00)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
(3 more...)

Add feedback

Visual Place Cell Encoding: A Computational Model for Spatial Representation and Cognitive Mapping

Hamilton, Chance J., Weitzenfeld, Alfredo

arXiv.org Artificial IntelligenceApr-23-2025

This paper presents the Visual Place Cell Encoding (VPCE) model, a biologically inspired computational framework for simulating place cell-like activation using visual input. Drawing on evidence that visual landmarks play a central role in spatial encoding, the proposed VPCE model activates visual place cells by clustering high-dimensional appearance features extracted from images captured by a robot-mounted camera. Each cluster center defines a receptive field, and activation is computed based on visual similarity using a radial basis function. We evaluate whether the resulting activation patterns correlate with key properties of biological place cells, including spatial proximity, orientation alignment, and boundary differentiation. Experiments demonstrate that the VPCE can distinguish between visually similar yet spatially distinct locations and adapt to environment changes such as the insertion or removal of walls. These results suggest that structured visual input, even in the absence of motion cues or reward-driven learning, is sufficient to generate place-cell-like spatial representations and support biologically inspired cognitive mapping.

artificial intelligence, machine learning, spatial reasoning, (13 more...)

arXiv.org Artificial Intelligence

2504.15953

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

On the Price of Differential Privacy for Hierarchical Clustering

Deng, Chengyuan, Gao, Jie, Upadhyay, Jalaj, Wang, Chen, Zhou, Samson

arXiv.org Artificial IntelligenceApr-23-2025

Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta's objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves $O(\log^{1.5}n/\varepsilon)$ multiplicative error for $\varepsilon$-DP and runs in polynomial time, where $n$ is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. $Ω(n^2/\varepsilon)$ additive error. As a result, we also obtain a new lower bound of $\tildeΩ(1/\varepsilon)$ additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2504.1558

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback