Clustering
A Novel Normalized-Cut Solver with Nearest Neighbor Hierarchical Initialization
Nie, Feiping, Lu, Jitao, Wu, Danyang, Wang, Rong, Li, Xuelong
Normalized-Cut (N-Cut) is a famous model of spectral clustering. The traditional N-Cut solvers are two-stage: 1) calculating the continuous spectral embedding of normalized Laplacian matrix; 2) discretization via $K$-means or spectral rotation. However, this paradigm brings two vital problems: 1) two-stage methods solve a relaxed version of the original problem, so they cannot obtain good solutions for the original N-Cut problem; 2) solving the relaxed problem requires eigenvalue decomposition, which has $\mathcal{O}(n^3)$ time complexity ($n$ is the number of nodes). To address the problems, we propose a novel N-Cut solver designed based on the famous coordinate descent method. Since the vanilla coordinate descent method also has $\mathcal{O}(n^3)$ time complexity, we design various accelerating strategies to reduce the time complexity to $\mathcal{O}(|E|)$ ($|E|$ is the number of edges). To avoid reliance on random initialization which brings uncertainties to clustering, we propose an efficient initialization method that gives deterministic outputs. Extensive experiments on several benchmark datasets demonstrate that the proposed solver can obtain larger objective values of N-Cut, meanwhile achieving better clustering performance compared to traditional solvers.
ConstraintMatch for Semi-constrained Clustering
Goschenhofer, Jann, Bischl, Bernd, Kira, Zsolt
Constrained clustering allows the training of classification models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of \textit{unconstrained} data is available alongside a smaller set of constraints, and propose \textit{ConstraintMatch} to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a \textit{pseudo-constraining} mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of \textit{informative} unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.
Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means
Basu, Supratik, Choudhury, Jyotishka Ray, Paul, Debolina, Das, Swagatam
Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
Speech-Based Blood Pressure Estimation with Enhanced Optimization and Incremental Clustering
Rajput, Vaishali, Mulay, Preeti, Raje, Rajeev
Commented [ZS1]: Add ORCIDs to all authors that have them. Abstract Blood Pressure (BP) estimation plays a pivotal role in diagnosing various health conditions, highlighting the need for innovative approaches to overcome conventional measurement challenges. Leveraging machine learning and speech signals, this study investigates accurate BP estimation with a focus on preprocessing, feature extraction, and real-time applications. An advanced clusteringbased strategy, incorporating the k-means algorithm and the proposed Fact-Finding Instructor optimization algorithm, is introduced to enhance accuracy. The combined outcome of these clustering techniques enables robust BP estimation. Moreover, extending beyond these insights, this study delves into the dynamic realm of contemporary digital content consumption. Platforms like YouTube have emerged as influential spaces, presenting an array of videos that evoke diverse emotions. Within this context, this research investigates the interplay between YouTube videos and physiological responses, particularly Blood Pressure (BP) levels. By integrating advanced BP estimation techniques with the emotional dimensions of YouTube videos, this study enriches our understanding of how modern media environments intersect with health implications. Performance evaluation through metrics including Davies Bouldin score, Homogeneity, completeness, Jacquard similarity, Silhouette score, and Dunn's index demonstrates substantial enhancements, particularly with a 90% training percentage. This method offers promising potential for accurate BP estimation, contributing to the evolution of assessment methodologies and ultimately enhancing healthcare outcomes. Introduction: According to Kaur et al. (2019) a human disease is a particular aberrant state that has a detrimental effect on an organism's overall structure or function but is not instantly caused by an external injury. According to Gautam et al.(2019) there are four main groups of diseases that affect humans: infectious diseases, deficiency disorders, hereditary diseases, and physiological diseases.
A Hybrid SOM and K-means Model for Time Series Energy Consumption Clustering
Energy consumption analysis plays a pivotal role in addressing the challenges of sustainability and resource management. This paper introduces a novel approach to effectively cluster monthly energy consumption patterns by integrating two powerful techniques: Self-organizing maps and K-means clustering. The proposed method aims to exploit the benefits of both of these algorithms to enhance the accuracy and interpretability of clustering results for a dataset in which finding patterns is difficult. The main focus of this study is on a selection of time series energy consumption data from the Smart meters in London dataset. The data was preprocessed and reduced in dimensionality to capture essential temporal patterns while retaining their underlying structures. The SOM algorithm was utilized to extract the central representatives of the consumption patterns for each one of the houses over the course of each month, effectively reducing the dimensionality of the dataset and making it easier for analysis. Subsequently, the obtained SOM centroids were clustered using K-means, a popular centroid-based clustering technique. The experimental results demonstrated a significant silhouette score of 66%, indicating strong intra-cluster cohesion and inter-cluster separation which confirms the effectiveness of the proposed approach in the clustering task.
Inferring Actual Treatment Pathways from Patient Records
Wilkins-Caruana, Adrian, Bandara, Madhushi, Musial, Katarzyna, Catchpoole, Daniel, Kennedy, Paul J.
Treatment pathways are step-by-step plans outlining the recommended medical care for specific diseases; they get revised when different treatments are found to improve patient outcomes. Examining health records is an important part of this revision process, but inferring patients' actual treatments from health data is challenging due to complex event-coding schemes and the absence of pathway-related annotations. This study aims to infer the actual treatment steps for a particular patient group from administrative health records (AHR) - a common form of tabular healthcare data - and address several technique- and methodology-based gaps in treatment pathway-inference research. We introduce Defrag, a method for examining AHRs to infer the real-world treatment steps for a particular patient group. Defrag learns the semantic and temporal meaning of healthcare event sequences, allowing it to reliably infer treatment steps from complex healthcare data. To our knowledge, Defrag is the first pathway-inference method to utilise a neural network (NN), an approach made possible by a novel, self-supervised learning objective. We also developed a testing and validation framework for pathway inference, which we use to characterise and evaluate Defrag's pathway inference ability and compare against baselines. We demonstrate Defrag's effectiveness by identifying best-practice pathway fragments for breast cancer, lung cancer, and melanoma in public healthcare records. Additionally, we use synthetic data experiments to demonstrate the characteristics of the Defrag method, and to compare Defrag to several baselines where it significantly outperforms non-NN-based methods. Defrag significantly outperforms several existing pathway-inference methods and offers an innovative and effective approach for inferring treatment pathways from AHRs. Open-source code is provided to encourage further research in this area.
Uncertainty in GNN Learning Evaluations: The Importance of a Consistent Benchmark for Community Detection
Leeney, William, McConville, Ryan
Graph Neural Networks (GNNs) have improved unsupervised community detection of clustered nodes due to their ability to encode the dual dimensionality of the connectivity and feature information spaces of graphs. Identifying the latent communities has many practical applications from social networks to genomics. Current benchmarks of real world performance are confusing due to the variety of decisions influencing the evaluation of GNNs at this task. To address this, we propose a framework to establish a common evaluation protocol. We motivate and justify it by demonstrating the differences with and without the protocol. The W Randomness Coefficient is a metric proposed for assessing the consistency of algorithm rankings to quantify the reliability of results under the presence of randomness. We find that by ensuring the same evaluation criteria is followed, there may be significant differences from the reported performance of methods at this task, but a more complete evaluation and comparison of methods is possible.
$k$-Means Clustering for Persistent Homology
Cao, Yueqi, Leung, Prudence, Monod, Anthea
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the $k$-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework. Additionally, we perform numerical experiments on various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures; we find that $k$-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.
BHGNN-RT: Network embedding for directed heterogeneous graphs
Networks are one of the most valuable data structures for modeling problems in the real world. However, the most recent node embedding strategies have focused on undirected graphs, with limited attention to directed graphs, especially directed heterogeneous graphs. In this study, we first investigated the network properties of directed heterogeneous graphs. Based on network analysis, we proposed an embedding method, a bidirectional heterogeneous graph neural network with random teleport (BHGNN-RT), for directed heterogeneous graphs, that leverages bidirectional message-passing process and network heterogeneity. With the optimization of teleport proportion, BHGNN-RT is beneficial to overcome the over-smoothing problem. Extensive experiments on various datasets were conducted to verify the efficacy and efficiency of BHGNN-RT. Furthermore, we investigated the effects of message components, model layer, and teleport proportion on model performance. The performance comparison with all other baselines illustrates that BHGNN-RT achieves state-of-the-art performance, outperforming the benchmark methods in both node classification and unsupervised clustering tasks.
DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder
Bing, Zhenshan, Meng, Yuan, Yun, Yuqi, Su, Hang, Su, Xiaojie, Huang, Kai, Knoll, Alois
Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at: https://github.com/Ghiara/diva