Goto

Collaborating Authors

 hdbscan


CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

Mukherjee, Chandra Sekhar, Bae, Joonyoung, Zhang, Jiapeng

arXiv.org Artificial Intelligence

In this paper, we provide a novel perspective on the underlying structure of real-world data with ground-truth clusters via characterization of an abundantly observed yet often overlooked density-geometry correlation, that manifests itself as a multi-layered manifold structure. We leverage this correlation to design CoreSPECT (Core Space Projection based Enhancement of Clustering Techniques), a general framework that improves the performance of generic clustering algorithms. Our framework boosts the performance of clustering algorithms by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We provide initial theoretical support of the functionality of our framework under the assumption of our model, and then provide large-scale real-world experiments on 19 datasets that include standard image datasets as well as genomics datasets. We observe two notable improvements. First, CoreSPECT improves the NMI of K-Means by 20% on average, making it competitive to (and in some cases surpassing) the state-of-the-art manifold-based clustering algorithms, while being orders of magnitude faster. Secondly, our framework boosts the NMI of HDBSCAN by more than 100% on average, making it competitive to the state-of-the-art in several cases without requiring the true number of clusters and hyper-parameter tuning. The overall ARI improvements are higher.


DelTriC: A Novel Clustering Method with Accurate Outlier

Javurek, Tomas, Gregor, Michal, Kula, Sebastian, Simko, Marian

arXiv.org Artificial Intelligence

The paper introduces DelTriC (Delaunay Triangulation Clustering), a clustering algorithm which integrates PCA/UMAP-based projection, Delaunay triangulation, and a novel back-projection mechanism to form clusters in the original high-dimensional space. DelTriC decouples neighborhood construction from decision-making by first triangulating in a low-dimensional proxy to index local adjacency, and then back-projecting to the original space to perform robust edge pruning, merging, and anomaly detection. DelTriC can outperform traditional methods such as k-means, DBSCAN, and HDBSCAN in many scenarios; it is both scalable and accurate, and it also significantly improves outlier detection.


Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series

Mastriani, Emilio, Costa, Alessandro, Incardona, Federico, Munari, Kevin, Spinello, Sebastiano

arXiv.org Artificial Intelligence

Abstract--In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite their theoretical appeal, these complex approaches consistently underperformed compared to a simple Random Forest + XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window. Our findings highlight that, in scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation can outperform more sophisticated architectures, offering greater robustness, interpretability, and operational utility. In recent years, anomaly detection in time series has become a critical challenge in industrial applications [1].


Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic

Bhandarkar, Abhay, Mishra, Gaurav, Juchani, Khushi, Singhal, Harsh

arXiv.org Artificial Intelligence

This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.


Self-Supervised and Topological Signal-Quality Assessment for Any PPG Device

Shao, Wei, Zhang, Ruoyu, Liang, Zequan, Kourkchi, Ehsan, Rafatirad, Setareh, Homayoun, Houman

arXiv.org Artificial Intelligence

Wearable photoplethysmography (PPG) is embedded in billions of devices, yet its optical waveform is easily corrupted by motion, perfusion loss, and ambient light, jeopardizing downstream cardiometric analytics. Existing signal-quality assessment (SQA) methods rely either on brittle heuristics or on data-hungry supervised models. We introduce the first fully unsupervised SQA pipeline for wrist PPG. Stage 1 trains a contrastive 1-D ResNet-18 on 276 h of raw, unlabeled data from heterogeneous sources (varying in device and sampling frequency), yielding optical-emitter- and motion-invariant embeddings (i.e., the learned representation is stable across differences in LED wavelength, drive intensity, and device optics, as well as wrist motion). Stage 2 converts each 512-D encoder embedding into a 4-D topological signature via persistent homology (PH) and clusters these signatures with HDBSCAN. To produce a binary signal-quality index (SQI), the acceptable PPG signals are represented by the densest cluster while the remaining clusters are assumed to mainly contain poor-quality PPG signals. Without re-tuning, the SQI attains Silhouette, Davies-Bouldin, and Calinski-Harabasz scores of 0.72, 0.34, and 6173, respectively, on a stratified sample of 10,000 windows. In this study, we propose a hybrid self-supervised-learning--topological-data-analysis (SSL--TDA) framework that offers a drop-in, scalable, cross-device quality gate for PPG signals.


What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets

Wilkinson, Meghan, Thomson, Robert H

arXiv.org Artificial Intelligence

Supervised machine learning techniques rely on labeled data to achieve high task performance, but this requires the labels to capture some meaningful differences in the underlying data structure. For training network intrusion detection algorithms, most datasets contain a series of attack classes and a single large benign class which captures all non-attack network traffic. A review of intrusion detection papers and guides that explicitly state their data preprocessing steps identified that the majority took the labeled categories of the dataset at face value when training their algorithms. The present paper evaluates the structure of benign traffic in several common intrusion detection datasets (NSL-KDD, UNSW-NB15, and CIC-IDS 2017) and determines whether there are meaningful sub-categories within this traffic which may improve overall multi-classification performance using common machine learning techniques. We present an overview of some unsupervised clustering techniques (e.g., HDBSCAN, Mean Shift Clustering) and show how they differentially cluster the benign traffic space.


Holistic Evaluations of Topic Models

Compton, Thomas

arXiv.org Artificial Intelligence

Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a 'black box', where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models


Meteoroid stream identification with HDBSCAN unsupervised clustering algorithm

Peña-Asensio, Eloy, Ferrari, Fabio

arXiv.org Artificial Intelligence

Accurate identification of meteoroid streams is central to understanding their origins and evolution. However, overlapping clusters and background noise hinder classification, an issue amplified for missions such as ESA's LUMIO that rely on meteor shower observations to infer lunar meteoroid impact parameters. This study evaluates the performance of the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm for unsupervised meteoroid stream identification, comparing its outcomes with the established Cameras for All-Sky Meteor Surveillance (CAMS) look-up table method. We analyze the CAMS Meteoroid Orbit Database v3.0 using three feature vectors: LUTAB (CAMS geocentric parameters), ORBIT (heliocentric orbital elements), and GEO (adapted geocentric parameters). HDBSCAN is applied with varying minimum cluster sizes and two cluster selection methods (eom and leaf). To align HDBSCAN clusters with CAMS classifications, the Hungarian algorithm determines the optimal mapping. Clustering performance is assessed via the Silhouette score, Normalized Mutual Information, and F1 score, with Principal Component Analysis further supporting the analysis. With the GEO vector, HDBSCAN confirms 39 meteoroid streams, 21 strongly aligning with CAMS. The ORBIT vector identifies 30 streams, 13 with high matching scores. Less active showers pose identification challenges. The eom method consistently yields superior performance and agreement with CAMS. Although HDBSCAN requires careful selection of the minimum cluster size, it delivers robust, internally consistent clusters and outperforms the look-up table method in statistical coherence. These results underscore HDBSCAN's potential as a mathematically consistent alternative for meteoroid stream identification, although further validation is needed to assess physical validity.


A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering

Krishnamurthy, Madan, Korn, Daniel, Haendel, Melissa A, Mungall, Christopher J, Thessen, Anne E

arXiv.org Artificial Intelligence

This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM summarization, and (4) supervised learning to train a classifier assigning new or unclustered CDEs to labeled clusters. Evaluated on the NIH NLM CDE Repository with over 24,000 CDEs, the system identified 118 meaningful clusters at an optimized minimum cluster size of 20. The classifier achieved 90.46 percent overall accuracy, performing best in larger categories. External validation against Gravity Projects Social Determinants of Health domains showed strong agreement (Adjusted Rand Index 0.52, Normalized Mutual Information 0.78), indicating that embeddings effectively capture cluster characteristics. This adaptable and scalable approach offers a practical solution to CDE harmonization, improving selection efficiency and supporting ongoing data interoperability.


Act Natural! Extending Naturalistic Projection to Multimodal Behavior Scenarios

Khan, Hamzah I., Fridovich-Keil, David

arXiv.org Artificial Intelligence

--Autonomous agents operating in public spaces must consider how their behaviors might affect the humans around them, even when not directly interacting with them. T o this end, it is often beneficial to be predictable and appear naturalistic. Existing methods for this purpose use human actor intent modeling or imitation learning techniques, but these approaches rarely capture all possible motivations for human behavior and/or require significant amounts of data. Our work extends a technique for modeling unimodal naturalistic behaviors with an explicit convex set representation, to account for multimodal behavior by using multiple convex sets. This more flexible representation provides a higher degree of fidelity in data-driven modeling of naturalistic behavior that arises in real-world scenarios in which human behavior is, in some sense, discrete, e.g. Equipped with this new set representation, we develop an optimization-based filter to project arbitrary trajectories into the set so that they appear naturalistic to humans in the scene, while also satisfying vehicle dynamics, actuator limits, etc. We demonstrate our methods on real-world human driving data from the inD (intersection) and rounD (roundabout) datasets. Safe and comfortable interaction between humans and autonomous agents requires a measure of predictable and naturalistic behavior from autonomous systems. Autonomous agents in the real world can easily find themselves in unsafe situations when they violate these informal norms of humanlike behavior. As one example, autonomous vehicles are well-documented to behave more cautiously than human drivers expect, which can lead to human drivers reacting unsafely to unexpected or abnormal driving and causing collisions [1]. Thus, autonomous agents must be able to plan and execute naturalistic, human-like behavior. However, naturalistic behavior tends to be challenging to model mathematically because human preferences and decision-making are opaque. Nevertheless, there exists a need for techniques which are able to model the wide variety of naturalistic behavior based on observations of human actions. This work was supported by the National Science Foundation under Grants 2211548 and 2336840. Each nonconvex set is represented by the union of a set of convex hulls which are formed by clustering the trajectory states at each time. Then, we project arbitrary trajectories into this set to make the behaviors more naturalistic.