Clustering
Diabetes subtypes classification for personalized health care: A review - Artificial Intelligence Review
Healthcare is evolving from standard to personalized, driven by the patients' needs. Personalized healthcare is a medical model based on genetics, genomics, and other biological information that helps to predict risk for disease. To date, machine learning and data mining are the fastest-growing healthcare field used to classify patient cohorts from a large dataset and its application for diabetes subtyping will be a breakthrough. In this review paper, we have identified, analyzed, and summarized how previous studies distinguished diabetes into subtypes besides implementing the methods for diabetes subtyping using data mining and various clustering algorithms. We have discovered that many studies have suggested diabetes can be differentiated into subtypes clinically based on the risk complications, genetically defined, using clinical features, and for treatment selection.
Chronological Self-Training for Real-Time Speaker Diarization
Padfield, Dirk, Liebling, Daniel J.
Diarization partitions an audio stream into segments based on the voices of the speakers. Real-time diarization systems that include an enrollment step should limit enrollment training samples to reduce user interaction time. Although training on a small number of samples yields poor performance, we show that the accuracy can be improved dramatically using a chronological self-training approach. We studied the tradeoff between training time and classification performance and found that 1 second is sufficient to reach over 95% accuracy. We evaluated on 700 audio conversation files of about 10 minutes each from 6 different languages and demonstrated average diarization error rates as low as 10%.
Improving Fuzzy-Logic based Map-Matching Method with Trajectory Stay-Point Detection
Jafarlou, Minoo, E., Omid Mahdi Ebadati, Naderi, Hassan
The requirement to trace and process moving objects in the contemporary era gradually increases since numerous applications quickly demand precise moving object locations. The Map-matching method is employed as a preprocessing technique, which matches a moving object point on a corresponding road. However, most of the GPS trajectory datasets include stay-points irregularity, which makes map-matching algorithms mismatch trajectories to irrelevant streets. Therefore, determining the stay-point region in GPS trajectory datasets results in better accurate matching and more rapid approaches. In this work, we cluster stay-points in a trajectory dataset with DBSCAN and eliminate redundant data to improve the efficiency of the map-matching algorithm by lowering processing time. We reckoned our proposed method's performance and exactness with a ground truth dataset compared to a fuzzy-logic based map-matching algorithm. Fortunately, our approach yields 27.39% data size reduction and 8.9% processing time reduction with the same accurate results as the previous fuzzy-logic based map-matching approach.
An Introduction to Graph Partitioning Algorithms and Community Detection
Graph partitioning has been a long-lasting problem and has a wide range of applications. This post shares the methodology for graph partitioning with both theoretical explanations and practical implementations of some popular graph partitioning algorithms with python codes. "Clustering" can be confusing under different contexts. In this article, clustering means node clustering, i.e. partitioning the graphs into clusters (or communities). We use graph partitioning, (node) clustering, and community detection interchangeably. In other words, we do not consider overlapping communities anywhere in this article.
Are Cluster Validity Measures (In)valid?
Gagolewski, Marek, Bartoszuk, Maciej, Cena, Anna
Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.
Enabling scalable clinical interpretation of ML-based phenotypes using real world data
Parsons, Owen, Barlow, Nathan E, Baxter, Janie, Paraschin, Karen, Derix, Andrea, Hein, Peter, Dรผrichen, Robert
The availability of large and deep electronic healthcare records (EHR) datasets has the potential to enable a better understanding of real-world patient journeys, and to identify novel subgroups of patients. ML-based aggregation of EHR data is mostly tool-driven, i.e., building on available or newly developed methods. However, these methods, their input requirements, and, importantly, resulting output are frequently difficult to interpret, especially without in-depth data science or statistical training. This endangers the final step of analysis where an actionable and clinically meaningful interpretation is needed.This study investigates approaches to perform patient stratification analysis at scale using large EHR datasets and multiple clustering methods for clinical research. We have developed several tools to facilitate the clinical evaluation and interpretation of unsupervised patient stratification results, namely pattern screening, meta clustering, surrogate modeling, and curation. These tools can be used at different stages within the analysis. As compared to a standard analysis approach, we demonstrate the ability to condense results and optimize analysis time. In the case of meta clustering, we demonstrate that the number of patient clusters can be reduced from 72 to 3 in one example. In another stratification result, by using surrogate models, we could quickly identify that heart failure patients were stratified if blood sodium measurements were available. As this is a routine measurement performed for all patients with heart failure, this indicated a data bias. By using further cohort and feature curation, these patients and other irrelevant features could be removed to increase the clinical meaningfulness. These examples show the effectiveness of the proposed methods and we hope to encourage further research in this field.
Maximal Independent Vertex Set applied to Graph Pooling
Stanovic, Stevan, Gaรผzรจre, Benoit, Brun, Luc
Convolutional neural networks (CNN) have enabled major advances in image classification through convolution and pooling. In particular, image pooling transforms a connected discrete grid into a reduced grid with the same connectivity and allows reduction functions to take into account all the pixels of an image. However, a pooling satisfying such properties does not exist for graphs. Indeed, some methods are based on a vertex selection step which induces an important loss of information. Other methods learn a fuzzy clustering of vertex sets which induces almost complete reduced graphs. We propose to overcome both problems using a new pooling method, named MIVSPool. This method is based on a selection of vertices called surviving vertices using a Maximal Independent Vertex Set (MIVS) and an assignment of the remaining vertices to the survivors. Consequently, our method does not discard any vertex information nor artificially increase the density of the graph. Experimental results show an increase in accuracy for graph classification on various standard datasets.
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Silva, Marรญlia Costa Rosendo, Siqueira, Felipe Alves, Tarrega, Joรฃo Pedro Mantovani, Beinotti, Joรฃo Vitor Pataca, Nunes, Augusto Sousa, Gardini, Miguel de Mattos, da Silva, Vinรญcius Adolfo Pereira, da Silva, Nรกdia Fรฉlix Felipe, de Carvalho, Andrรฉ Carlos Ponce de Leon Ferreira
Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works.
A Tighter Analysis of Spectral Clustering, and Beyond
This work studies the classical spectral clustering algorithm which embeds the vertices of some graph $G=(V_G, E_G)$ into $\mathbb{R}^k$ using $k$ eigenvectors of some matrix of $G$, and applies $k$-means to partition $V_G$ into $k$ clusters. Our first result is a tighter analysis on the performance of spectral clustering, and explains why it works under some much weaker condition than the ones studied in the literature. For the second result, we show that, by applying fewer than $k$ eigenvectors to construct the embedding, spectral clustering is able to produce better output for many practical instances; this result is the first of its kind in spectral clustering. Besides its conceptual and theoretical significance, the practical impact of our work is demonstrated by the empirical analysis on both synthetic and real-world datasets, in which spectral clustering produces comparable or better results with fewer than $k$ eigenvectors.
Cluster Weighted Model Based on TSNE algorithm for High-Dimensional Data
Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of "Curse of dimensionality" on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the "FlexCWM" R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.