Goto

Collaborating Authors

 Clustering


QuicK-means: Acceleration of K-means by learning a fast transform

arXiv.org Machine Learning

K-means -- and the celebrated Lloyd algorithm -- is more than the clustering method it was originally designed to be. It has indeed proven pivotal to help increase the speed of many machine learning and data analysis techniques such as indexing, nearest-neighbor search and prediction, data compression; its beneficial use has been shown to carry over to the acceleration of kernel machines (when using the Nystr\"om method). Here, we propose a fast extension of K-means, dubbed QuicK-means, that rests on the idea of expressing the matrix of the $K$ centroids as a product of sparse matrices, a feat made possible by recent results devoted to find approximations of matrices as a product of sparse factors. Using such a decomposition squashes the complexity of the matrix-vector product between the factorized $K \times D$ centroid matrix $\mathbf{U}$ and any vector from $\mathcal{O}(K D)$ to $\mathcal{O}(A \log A+B)$, with $A=\min (K, D)$ and $B=\max (K, D)$, where $D$ is the dimension of the training data. This drastic computational saving has a direct impact in the assignment process of a point to a cluster, meaning that it is not only tangible at prediction time, but also at training time, provided the factorization procedure is performed during Lloyd's algorithm. We precisely show that resorting to a factorization step at each iteration does not impair the convergence of the optimization scheme and that, depending on the context, it may entail a reduction of the training time. Finally, we provide discussions and numerical simulations that show the versatility of our computationally-efficient QuicK-means algorithm.


Identification of Pediatric Sepsis Subphenotypes for Enhanced Machine Learning Predictive Performance: A Latent Profile Analysis

arXiv.org Machine Learning

Background: While machine learning (ML) models are rapidly emerging as promising screening tools in critical care medicine, the identification of homogeneous subphenotypes within populations with heterogeneous conditions such as pediatric sepsis may facilitate attainment of high-predictive performance of these prognostic algorithms. This study is aimed to identify subphenotypes of pediatric sepsis and demonstrate the potential value of partitioned data/subtyping-based training. Methods: This was a retrospective study of clinical data extracted from medical records of 6,446 pediatric patients that were admitted at a major hospital system in the DC area. Vitals and labs associated with patients meeting the diagnostic criteria for sepsis were used to perform latent profile analysis. Modern ML algorithms were used to explore the predictive performance benefits of reduced training data heterogeneity via label profiling. Results: In total 134 (2.1%) patients met the diagnostic criteria for sepsis in this cohort and latent profile analysis identified four profiles/subphenotypes of pediatric sepsis. Profiles 1 and 3 had the lowest mortality and included pediatric patients from different age groups. Profile 2 were characterized by respiratory dysfunction; profile 4 by neurological dysfunction and highest mortality rate (22.2%). Machine learning experiments comparing the predictive performance of models derived without training data profiling against profile targeted models suggest statistically significant improved performance of prediction can be obtained. For example, area under ROC curve (AUC) obtained to predict profile 4 with 24-hour data (AUC = .998, p < .0001) compared favorably with the AUC obtained from the model considering all profiles as a single homogeneous group (AUC = .918) with 24-hour data.


Estimation of Spectral Clustering Hyper Parameters

arXiv.org Machine Learning

Robust automation of analysis procedures capable of handling diverse data sets is critical for high data throughput experiments at the Linac Coherent Light Source (LCLS). One challenge encountered in this process is determining the number of clusters required for the execution of conventional clustering algorithms. It is demonstrated here that bi-cross validation of the inverted and regularized Laplacian, used in the spectral clustering algorithm, yields a robust minimum at the predicted number of clusters and kernel hyper parameters. These results indicate that the process of estimating the number of clusters should not be divorced from the process of estimating other hyper parameters. Applying this method to LCLS xray scattering data demonstrates the ability to identify clusters of dropped shots without manually setting boundaries on detector fluence and provides a path towards identifying rare events.


Tracking Behavioral Patterns among Students in an Online Educational System

arXiv.org Machine Learning

Analysis of log data generated by online educational systems is an essential task to better the educational systems and increase our understanding of how students learn. In this study we investigate previously unseen data from Clio Online, the largest provider of digital learning content for primary schools in Denmark. We consider data for 14,810 students with 3 million sessions in the period 2015-2017. We analyze student activity in periods of one week. By using non-negative matrix factorization techniques, we obtain soft clusterings, revealing dependencies among time of day, subject, activity type, activity complexity (measured by Bloom's taxonomy), and performance. Furthermore, our method allows for tracking behavioral changes of individual students over time, as well as general behavioral changes in the educational system. Based on the results, we give suggestions for behavioral changes, in order to optimize the learning experience and improve performance.


N2D:(Not Too) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding

arXiv.org Machine Learning

--Deep clustering has increasingly been demonstrating superiority over conventional shallow clustering algorithms. Deep clustering algorithms usually combine representation learning with deep neural networks to achieve this performance, typically optimizing a clustering and non-clustering loss. In such cases, an autoencoder is typically connected with a clustering network, and the final clustering is jointly learned by both the autoencoder and clustering network. Instead, we propose to learn an autoencoded embedding and then search this further for the underlying manifold. We study a number of local and global manifold learning methods on both the raw data and autoencoded embedding, concluding that UMAP in our framework is able to find the best clusterable manifold of the embedding. This suggests that local manifold learning on an autoencoded embedding is effective for discovering higher quality clusters. We quantitatively show across a range of image and time-series datasets that our method has competitive performance against the latest deep clustering algorithms, including outperforming current state-of-the-art on several. We postulate that these results show a promising research direction for deep clustering. Clustering is a fundamental pillar of unsupervised machine learning. It is widely used in a range of tasks across disciplines and well-known algorithms such as k -means have found success in many applications.


Fuzzy C-Means Clustering and Sonification of HRV Features

arXiv.org Machine Learning

Linear and non-linear measures of heart rate variability (HRV) are widely investigated as non-invasive indicators of health. Stress has a profound impact on heart rate, and different meditation techniques have been found to modulate heartbeat rhythm. This paper aims to explore the process of identifying appropriate metrices from HRV analysis for sonification. Sonification is a type of auditory display involving the process of mapping data to acoustic parameters. This work explores the use of auditory display in aiding the analysis of HRV leveraged by unsupervised machine learning techniques. Unsupervised clustering helps select the appropriate features to improve the sonification interpretability. Vocal synthesis sonification techniques are employed to increase comprehension and learnability of the processed data displayed through sound. These analyses are early steps in building a real-time sound-based biofeedback training system.


Robust and Efficient Fuzzy C-Means Clustering Constrained on Flexible Sparsity

arXiv.org Machine Learning

--Clustering is an effective technique in data mining to group a set of objects in terms of some attributes. Theoretical analyses and extensive experiments on several public datasets demonstrate the effectiveness and rationality of our proposed REFCMFS method. S a fundamental problem in machine learning, clustering is widely used for many fields, such as the network data (including Protein-Protein Interaction Networks [1], Road Networks [2], Geo-Social Network [3]), medical diagnosis [4], biological data analysis [5], environmental chemistry [6] and so on. K-Means clustering is one of the most popular techniques because of its simplicity and effectiveness, which randomly initializes the cluster centroids, assigns each sample to its nearest cluster and then updates cluster centroid itera-tively to cluster a dataset into some subsets. Over the past years, many modified versions of K-Means algorithms have been proposed, such as K-Means based Consensus clustering [7], Optimized Cartesian K-Means [8], Group K-Means [9] and so on. Jinglin Xu and Junwei Han were with the School of Automation, Northwestern Polytechnical University, Xi'an 710072, Shaanxi, China. Feiping Nie is with School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an 710072, Shaanxi, China. Xuelong Li is with School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an 710072, Shaanxi, China.


Regression on imperfect class labels derived by unsupervised clustering

arXiv.org Machine Learning

In biomarker studies it is popular to perform an unsupervised clustering of high-dimensional variables like genome wide screens of SNPs, gene expressions, and protein data and regress for example treatment response, patient recorded outcome measures, time to disease progression, or overall survival on these potentially mislabelled clusters. It is well-known from the statistical literature that errors in continuous and categorical covariates can lead to loss of important information about effects on outcome (Carroll et al., 2006). However, to our surprise this is often ignored when regressing outcome on classes identified by unsupervised learning, which might lead to important clinical effect measures being overlooked (Alizadeh et al., 2000; Veer et al., 2002; Guinney et al., 2015; Zhan et al., 2006; Broyl et al., 2010). We suggest to cast the problem as a covariate misclassification problem. This leaves us with a concourse of possible modelling and analysis options, see for example the book by Carroll et al. (2006) or the recent review by Brakenhoff et al. (2018).


Analyzing the Fine Structure of Distributions

arXiv.org Machine Learning

One aim of data mining is the identification of interesting structures in data. Basic properties of the empirical distribution, such as skewness and an eventual clipping, i.e., hard limits in value ranges, need to be assessed. Of particular interest is the question, whether the data originates from one process, or contains subsets related to different states of the data producing process. Data visualization tools should deliver a sensitive picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs are typically kernel density estimates and range from the classical histogram to modern tools like bean or violin plots. Conventional methods have difficulties in visualizing the pdf in case of uniform, multimodal, skewed and clipped data if density estimation parameters remain in a default setting. As a consequence, a new visualization tool called Mirrored Density plot (MD plot) is proposed which is particularly designed to discover interesting structures in continuous features. The MD plot does not require any adjustments of parameters of density estimation which makes the usage compelling for non-experts. The visualization tools are evaluated in comparison to statistical tests for the typical challenges of explorative distribution analysis. The results are presented on bimodal Gaussian and skewed distributions as well as several features with published pdfs. In exploratory data analysis of 12 features describing the quarterly financial statements, when statistical testing becomes a demanding task, only the MD plots can identify the structure of their pdfs. Overall, the MD plot can outperform the methods mentioned above.


Introduction to Image Segmentation with K-Means clustering

#artificialintelligence

Image segmentation is an important step in image processing, and it seems everywhere if we want to analyze what's inside the image. For example, if we seek to find if there is a chair or person inside an indoor image, we may need image segmentation to separate objects and analyze each object individually to check what it is. Image segmentation usually serves as the pre-processing before pattern recognition, feature extraction, and compression of the image. Image segmentation is the classification of an image into different groups. Many kinds of research have been done in the area of image segmentation using clustering. There are different methods and one of the most popular methods is K-Means clustering algorithm.