Clustering
Autoencoder Based Iterative Modeling and Multivariate Time-Series Subsequence Clustering Algorithm
Köhne, Jonas, Henning, Lars, Gühmann, Clemens
This paper introduces an algorithm for the detection of change-points and the identification of the corresponding subsequences in transient multivariate time-series data (MTSD). The analysis of such data has become more and more important due to the increase of availability in many industrial fields. Labeling, sorting or filtering highly transient measurement data for training condition based maintenance (CbM) models is cumbersome and error-prone. For some applications it can be sufficient to filter measurements by simple thresholds or finding change-points based on changes in mean value and variation. But a robust diagnosis of a component within a component group for example, which has a complex non-linear correlation between multiple sensor values, a simple approach would not be feasible. No meaningful and coherent measurement data which could be used for training a CbM model would emerge. Therefore, we introduce an algorithm which uses a recurrent neural network (RNN) based Autoencoder (AE) which is iteratively trained on incoming data. The scoring function uses the reconstruction error and latent space information. A model of the identified subsequence is saved and used for recognition of repeating subsequences as well as fast offline clustering. For evaluation, we propose a new similarity measure based on the curvature for a more intuitive time-series subsequence clustering metric. A comparison with seven other state-of-the-art algorithms and eight datasets shows the capability and the increased performance of our algorithm to cluster MTSD online and offline in conjunction with mechatronic systems.
Creating Compact Regions of Social Determinants of Health
Lattimer, Barrett, Lattimer, Alan
Regionalization is the act of breaking a dataset into contiguous homogeneous regions that are heterogeneous from each other. Many different algorithms exist for performing regionalization; however, using these algorithms on large real world data sets have only become feasible in terms of compute power in recent years. Very few studies have been done comparing different regionalization methods, and those that do lack analysis in memory, scalability, geographic metrics, and large-scale real-world applications. This study compares state-of-the-art regionalization methods, namely, Agglomerative Clustering, SKATER, REDCAP, AZP, and Max-P-Regions using real world social determinant of health (SDOH) data. The scale of real world SDOH data, up to 1 million data points in this study, not only compares the algorithms over different data sets but provides a stress test for each individual regionalization algorithm, most of which have never been run on such scales previously. We use several new geographic metrics to compare algorithms as well as perform a comparative memory analysis. The prevailing regionalization method is then compared with unconstrained K-Means clustering on their ability to separate real health data in Virginia and Washington DC.
Library transfer between distinct Laser-Induced Breakdown Spectroscopy systems with shared standards
Vrábel, J., Képeš, E., Nedělník, P., Buday, J., Cempírek, J., Pořízka, P., Kaiser, J.
The mutual incompatibility of distinct spectroscopic systems is among the most limiting factors in Laser-Induced Breakdown Spectroscopy (LIBS). The cost related to setting up a new LIBS system is increased, as its extensive calibration is required. Solving the problem would enable inter-laboratory reference measurements and shared spectral libraries, which are fundamental for other spectroscopic techniques. In this work, we study a simplified version of this challenge where LIBS systems differ only in used spectrometers and collection optics but share all other parts of the apparatus, and collect spectra simultaneously from the same plasma plume. Extensive datasets measured as hyperspectral images of heterogeneous specimens are used to train machine learning models that can transfer spectra between systems. The transfer is realized by a pipeline that consists of a variational autoencoder (VAE) and a fully-connected artificial neural network (ANN). In the first step, we obtain a latent representation of the spectra which were measured on the Primary system (by using the VAE). In the second step, we map spectra from the Secondary system to corresponding locations in the latent space (by the ANN). Finally, Secondary system spectra are reconstructed from the latent space to the space of the Primary system. The transfer is evaluated by several figures of merit (Euclidean and cosine distances, both spatially resolved; k-means clustering of transferred spectra). The methodology is compared to several baseline approaches.
Clustering Algorithms in Machine Learning
Machine Learning problems deal with a great deal of data and depend heavily on the algorithms that are used to train the model. There are various approaches and algorithms to train a machine learning model based on the problem at hand. Supervised and unsupervised learning are the two most prominent of these approaches. An important real-life problem of marketing a product or service to a specific target audience can be easily resolved with the help of a form of unsupervised learning known as Clustering. This article will explain clustering algorithms along with real-life problems and examples.
A Bibliographic View on Constrained Clustering
Kuncheva, Ludmila, Williams, Francis, Hennessey, Samuel
A keyword search on constrained clustering on Web-of-Science returned just under 3,000 documents. We ran automatic analyses of those, and compiled our own bibliography of 183 papers which we analysed in more detail based on their topic and experimental study, if any. This paper presents general trends of the area and its sub-topics by Pareto analysis, using citation count and year of publication. We list available software and analyse the experimental sections of our reference collection. We found a notable lack of large comparison experiments. Among the topics we reviewed, applications studies were most abundant recently, alongside deep learning, active learning and ensemble learning.
Out-of-Distribution Detection Without Class Labels
Cohen, Niv, Abutbul, Ron, Hoshen, Yedid
Out-of-distribution detection seeks to identify novelties, samples that deviate from the norm. The task has been found to be quite challenging, particularly in the case where the normal data distribution consists of multiple semantic classes (e.g., multiple object categories). To overcome this challenge, current approaches require manual labeling of the normal images provided during training. In this work, we tackle multi-class novelty detection without class labels. Our simple but effective solution consists of two stages: we first discover "pseudo-class" labels using unsupervised clustering. Then using these pseudo-class labels, we are able to use standard supervised out-of-distribution detection methods. We verify the performance of our method by a favorable comparison to the state-of-the-art, and provide extensive analysis and ablations.
Non-Negative Matrix Factorization with Scale Data Structure Preservation
Hedjam, Rachid, Abdesselam, Abdelhamid, Rahiche, Abderrahmane, Cheriet, Mohamed
Low-rank matrix factorization (MF) is a hot topic in many research problems such as feature extraction and dimensionality reduction Vidal et al. [2005], subspace segmentation Liu et al. [2010], data clustering Favaro et al. [2011], image processing and computer vision Peng et al. [2012] to mention a few. The key idea behind MF is that there is a latent data structure embedded in the high dimensional observed data which, once discovered, provides better capacity for learning. Formally, MF techniques aim to decompose an observed high-dimensional data matrix into its constitute lower-dimensional factorizing matrices (in general two). One of the factorizing matrices represents the lower-dimensional space and the other one represents the spread of latent data in that space. MF has been widely used as a unified technique for dimensionality reduction, clustering, and matrix completion. There are several variants of MF in the literature including basic MF (BMF), non-negative MF (NMF) and Orthogonal NMF (ONMF). BMF are those described using traditional matrix decomposition such as principal component analysis (PCA), vector quantization (VQ) and singular value decomposition (SVD).
High-order Multi-view Clustering for Generic Data
Graph-based multi-view clustering has achieved better performance than most non-graph approaches. However, in many real-world scenarios, the graph structure of data is not given or the quality of initial graph is poor. Additionally, existing methods largely neglect the high-order neighborhood information that characterizes complex intrinsic interactions. To tackle these problems, we introduce an approach called high-order multi-view clustering (HMvC) to explore the topology structure information of generic data. Firstly, graph filtering is applied to encode structure information, which unifies the processing of attributed graph data and non-graph data in a single framework. Secondly, up to infinity-order intrinsic relationships are exploited to enrich the learned graph. Thirdly, to explore the consistent and complementary information of various views, an adaptive graph fusion mechanism is proposed to achieve a consensus graph. Comprehensive experimental results on both non-graph and attributed graph data show the superior performance of our method with respect to various state-of-the-art techniques, including some deep learning methods.
Efficient Distribution Similarity Identification in Clustered Federated Learning via Principal Angles Between Client Data Subspaces
Vahidian, Saeed, Morafah, Mahdi, Wang, Weijia, Kungurtsev, Vyacheslav, Chen, Chen, Shah, Mubarak, Lin, Bill
Clustered federated learning (FL) has been shown to produce promising results by grouping clients into clusters. This is especially effective in scenarios where separate groups of clients have significant differences in the distributions of their local data. Existing clustered FL algorithms are essentially trying to group together clients with similar distributions so that clients in the same cluster can leverage each other's data to better perform federated learning. However, prior clustered FL algorithms attempt to learn these distribution similarities indirectly during training, which can be quite time consuming as many rounds of federated learning may be required until the formation of clusters is stabilized. In this paper, we propose a new approach to federated learning that directly aims to efficiently identify distribution similarities among clients by analyzing the principal angles between the client data subspaces. Each client applies a truncated singular value decomposition (SVD) step on its local data in a single-shot manner to derive a small set of principal vectors, which provides a signature that succinctly captures the main characteristics of the underlying distribution. This small set of principal vectors is provided to the server so that the server can directly identify distribution similarities among the clients to form clusters. This is achieved by comparing the similarities of the principal angles between the client data subspaces spanned by those principal vectors. The approach provides a simple, yet effective clustered FL framework that addresses a broad range of data heterogeneity issues beyond simpler forms of Non-IIDness like label skews. Our clustered FL approach also enables convergence guarantees for non-convex objectives. Our code is available at https://github.com/MMorafah/PACFL.
SGC: A semi-supervised pipeline for gene clustering using self-training approach in gene co-expression networks
Aghaieabiane, Niloofar, Koutis, Ioannis
A widely used approach for extracting information from gene expression data employ the construction of a gene co-expression network and the subsequent application of algorithms that discover network structure. In particular, a common goal is the computational discovery of gene clusters, commonly called modules. When applied on a novel gene expression dataset, the quality of the computed modules can be evaluated automatically, using Gene Ontology enrichment, a method that measures the frequencies of Gene Ontology terms in the computed modules and evaluates their statistical likelihood. In this work we propose SGC a novel pipeline for gene clustering based on relatively recent seminal work in the mathematics of spectral network theory. SGC consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology information in a semi-supervised clustering method that further improves the quality of the computed modules. Comparing with already well-known existing frameworks, we show that SGC results in higher enrichment in real data. In particular, in 12 real gene expression datasets, SGC outperforms in all except one.