AITopics | hdbscan

Collaborating Authors

hdbscan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

Gagolewski, Marek

arXiv.org Machine LearningApr-9-2026

We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source 'lumbermark' package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

2604.07143

Country: Europe > Poland (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Add feedback

CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

Mukherjee, Chandra Sekhar, Bae, Joonyoung, Zhang, Jiapeng

arXiv.org Artificial IntelligenceDec-4-2025

In this paper, we provide a novel perspective on the underlying structure of real-world data with ground-truth clusters via characterization of an abundantly observed yet often overlooked density-geometry correlation, that manifests itself as a multi-layered manifold structure. We leverage this correlation to design CoreSPECT (Core Space Projection based Enhancement of Clustering Techniques), a general framework that improves the performance of generic clustering algorithms. Our framework boosts the performance of clustering algorithms by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We provide initial theoretical support of the functionality of our framework under the assumption of our model, and then provide large-scale real-world experiments on 19 datasets that include standard image datasets as well as genomics datasets. We observe two notable improvements. First, CoreSPECT improves the NMI of K-Means by 20% on average, making it competitive to (and in some cases surpassing) the state-of-the-art manifold-based clustering algorithms, while being orders of magnitude faster. Secondly, our framework boosts the NMI of HDBSCAN by more than 100% on average, making it competitive to the state-of-the-art in several cases without requiring the true number of clusters and hyper-parameter tuning. The overall ARI improvements are higher.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.08243

Genre:

Research Report (0.63)
Workflow (0.47)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.66)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

DelTriC: A Novel Clustering Method with Accurate Outlier

Javurek, Tomas, Gregor, Michal, Kula, Sebastian, Simko, Marian

arXiv.org Artificial IntelligenceNov-24-2025

The paper introduces DelTriC (Delaunay Triangulation Clustering), a clustering algorithm which integrates PCA/UMAP-based projection, Delaunay triangulation, and a novel back-projection mechanism to form clusters in the original high-dimensional space. DelTriC decouples neighborhood construction from decision-making by first triangulating in a low-dimensional proxy to index local adjacency, and then back-projecting to the original space to perform robust edge pruning, merging, and anomaly detection. DelTriC can outperform traditional methods such as k-means, DBSCAN, and HDBSCAN in many scenarios; it is both scalable and accurate, and it also significantly improves outlier detection.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.17219

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Segmentation over Complexity: Evaluating Ensemble and Hybrid Approaches for Anomaly Detection in Industrial Time Series

Mastriani, Emilio, Costa, Alessandro, Incardona, Federico, Munari, Kevin, Spinello, Sebastiano

arXiv.org Artificial IntelligenceOct-31-2025

Abstract--In this study, we investigate the effectiveness of advanced feature engineering and hybrid model architectures for anomaly detection in a multivariate industrial time series, focusing on a steam turbine system. We evaluate the impact of change point-derived statistical features, clustering-based substructure representations, and hybrid learning strategies on detection performance. Despite their theoretical appeal, these complex approaches consistently underperformed compared to a simple Random Forest + XGBoost ensemble trained on segmented data. The ensemble achieved an AUC-ROC of 0.976, F1-score of 0.41, and 100% early detection within the defined time window. Our findings highlight that, in scenarios with highly imbalanced and temporally uncertain data, model simplicity combined with optimized segmentation can outperform more sophisticated architectures, offering greater robustness, interpretability, and operational utility. In recent years, anomaly detection in time series has become a critical challenge in industrial applications [1].

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.26159

Genre: Research Report > New Finding (1.00)

Industry: Machinery > Industrial Machinery (0.34)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic

Bhandarkar, Abhay, Mishra, Gaurav, Juchani, Khushi, Singhal, Harsh

arXiv.org Artificial IntelligenceOct-10-2025

This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.07557

Country: Asia > India (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Self-Supervised and Topological Signal-Quality Assessment for Any PPG Device

Shao, Wei, Zhang, Ruoyu, Liang, Zequan, Kourkchi, Ehsan, Rafatirad, Setareh, Homayoun, Houman

arXiv.org Artificial IntelligenceSep-18-2025

Wearable photoplethysmography (PPG) is embedded in billions of devices, yet its optical waveform is easily corrupted by motion, perfusion loss, and ambient light, jeopardizing downstream cardiometric analytics. Existing signal-quality assessment (SQA) methods rely either on brittle heuristics or on data-hungry supervised models. We introduce the first fully unsupervised SQA pipeline for wrist PPG. Stage 1 trains a contrastive 1-D ResNet-18 on 276 h of raw, unlabeled data from heterogeneous sources (varying in device and sampling frequency), yielding optical-emitter- and motion-invariant embeddings (i.e., the learned representation is stable across differences in LED wavelength, drive intensity, and device optics, as well as wrist motion). Stage 2 converts each 512-D encoder embedding into a 4-D topological signature via persistent homology (PH) and clusters these signatures with HDBSCAN. To produce a binary signal-quality index (SQI), the acceptable PPG signals are represented by the densest cluster while the remaining clusters are assumed to mainly contain poor-quality PPG signals. Without re-tuning, the SQI attains Silhouette, Davies-Bouldin, and Calinski-Harabasz scores of 0.72, 0.34, and 6173, respectively, on a stratified sample of 10,000 windows. In this study, we propose a hybrid self-supervised-learning--topological-data-analysis (SSL--TDA) framework that offers a drop-in, scalable, cross-device quality gate for PPG signals.

artificial intelligence, machine learning, pipeline, (13 more...)

arXiv.org Artificial Intelligence

2509.1251

Country: North America > United States > California > Yolo County > Davis (0.15)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.31)

Add feedback

What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets

Wilkinson, Meghan, Thomson, Robert H

arXiv.org Artificial IntelligenceSep-12-2025

Supervised machine learning techniques rely on labeled data to achieve high task performance, but this requires the labels to capture some meaningful differences in the underlying data structure. For training network intrusion detection algorithms, most datasets contain a series of attack classes and a single large benign class which captures all non-attack network traffic. A review of intrusion detection papers and guides that explicitly state their data preprocessing steps identified that the majority took the labeled categories of the dataset at face value when training their algorithms. The present paper evaluates the structure of benign traffic in several common intrusion detection datasets (NSL-KDD, UNSW-NB15, and CIC-IDS 2017) and determines whether there are meaningful sub-categories within this traffic which may improve overall multi-classification performance using common machine learning techniques. We present an overview of some unsupervised clustering techniques (e.g., HDBSCAN, Mean Shift Clustering) and show how they differentially cluster the benign traffic space.

artificial intelligence, dataset, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2509.09564

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Holistic Evaluations of Topic Models

Compton, Thomas

arXiv.org Artificial IntelligenceAug-1-2025

Topic models are gaining increasing commercial and academic interest for their ability to summarize large volumes of unstructured text. As unsupervised machine learning methods, they enable researchers to explore data and help general users understand key themes in large text collections. However, they risk becoming a 'black box', where users input data and accept the output as an accurate summary without scrutiny. This article evaluates topic models from a database perspective, drawing insights from 1140 BERTopic model runs. The goal is to identify trade-offs in optimizing model parameters and to reflect on what these findings mean for the interpretation and responsible use of topic models

machine learning, natural language, topic model, (21 more...)

arXiv.org Artificial Intelligence

2507.23364

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Meteoroid stream identification with HDBSCAN unsupervised clustering algorithm

Peña-Asensio, Eloy, Ferrari, Fabio

arXiv.org Artificial IntelligenceJul-3-2025

Accurate identification of meteoroid streams is central to understanding their origins and evolution. However, overlapping clusters and background noise hinder classification, an issue amplified for missions such as ESA's LUMIO that rely on meteor shower observations to infer lunar meteoroid impact parameters. This study evaluates the performance of the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm for unsupervised meteoroid stream identification, comparing its outcomes with the established Cameras for All-Sky Meteor Surveillance (CAMS) look-up table method. We analyze the CAMS Meteoroid Orbit Database v3.0 using three feature vectors: LUTAB (CAMS geocentric parameters), ORBIT (heliocentric orbital elements), and GEO (adapted geocentric parameters). HDBSCAN is applied with varying minimum cluster sizes and two cluster selection methods (eom and leaf). To align HDBSCAN clusters with CAMS classifications, the Hungarian algorithm determines the optimal mapping. Clustering performance is assessed via the Silhouette score, Normalized Mutual Information, and F1 score, with Principal Component Analysis further supporting the analysis. With the GEO vector, HDBSCAN confirms 39 meteoroid streams, 21 strongly aligning with CAMS. The ORBIT vector identifies 30 streams, 13 with high matching scores. Less active showers pose identification challenges. The eom method consistently yields superior performance and agreement with CAMS. Although HDBSCAN requires careful selection of the minimum cluster size, it delivers robust, internally consistent clusters and outperforms the look-up table method in statistical coherence. These results underscore HDBSCAN's potential as a mathematically consistent alternative for meteoroid stream identification, although further validation is needed to assess physical validity.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.01501

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Italy > Lombardy > Milan (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering

Krishnamurthy, Madan, Korn, Daniel, Haendel, Melissa A, Mungall, Christopher J, Thessen, Anne E

arXiv.org Artificial IntelligenceJun-4-2025

This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets by addressing challenges such as semantic heterogeneity, structural variability, and context dependence to streamline integration, enhance interoperability, and accelerate scientific discovery. Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns. These embeddings are clustered using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to group semantically similar CDEs. The framework incorporates four key steps: (1) LLM-based text embedding to mathematically represent semantic context, (2) unsupervised clustering of embeddings via HDBSCAN, (3) automated labeling using LLM summarization, and (4) supervised learning to train a classifier assigning new or unclustered CDEs to labeled clusters. Evaluated on the NIH NLM CDE Repository with over 24,000 CDEs, the system identified 118 meaningful clusters at an optimized minimum cluster size of 20. The classifier achieved 90.46 percent overall accuracy, performing best in larger categories. External validation against Gravity Projects Social Determinants of Health domains showed strong agreement (Adjusted Rand Index 0.52, Normalized Mutual Information 0.78), indicating that embeddings effectively capture cluster characteristics. This adaptable and scalable approach offers a practical solution to CDE harmonization, improving selection efficiency and supporting ongoing data interoperability.

cde, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2506.0216

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.68)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback