Collaborating Authors

cluster analysis

Cluster Analysis


We are familiar with most of the supervised learning methods, for example, linear regression, logistic regression, decision trees, SVM so on… where for an input we have an associated output/label. When we have a problem in which we have input but no associated output/label such kind of learning is known as unsupervised learning. One mechanism that we may use in this context is cluster analysis or clustering. Definition 1: Cluster analysis is a multivariate statistical technique. It group's observations on the basis some of their features or variables they are described by.! Definition 2: Cluster analysis observations in a data set can be divided into different groups and is very useful.

A Mean Field Games model for finite mixtures of Bernoulli distributions Machine Learning

Finite mixture models are an important tool in the statistical analysis of data, for example in data clustering. The optimal parameters of a mixture model are usually computed by maximizing the log-likelihood functional via the Expectation-Maximization algorithm. We propose an alternative approach based on the theory of Mean Field Games, a class of differential games with an infinite number of agents. We show that the solution of a finite state space multi-population Mean Field Games system characterizes the critical points of the log-likelihood functional for a Bernoulli mixture. The approach is then generalized to mixture models of categorical distributions. Hence, the Mean Field Games approach provides a method to compute the parameters of the mixture model, and we show its application to some standard examples in cluster analysis.

Survival Cluster Analysis Machine Learning

Conventional survival analysis approaches estimate risk scores or individualized time-to-event distributions conditioned on covariates. In practice, there is often great population-level phenotypic heterogeneity, resulting from (unknown) subpopulations with diverse risk profiles or survival distributions. As a result, there is an unmet need in survival analysis for identifying subpopulations with distinct risk profiles, while jointly accounting for accurate individualized time-to-event predictions. An approach that addresses this need is likely to improve characterization of individual outcomes by leveraging regularities in subpopulations, thus accounting for population-level heterogeneity. In this paper, we propose a Bayesian nonparametrics approach that represents observations (subjects) in a clustered latent space, and encourages accurate time-to-event predictions and clusters (subpopulations) with distinct risk profiles. Experiments on real-world datasets show consistent improvements in predictive performance and interpretability relative to existing state-of-the-art survival analysis models.

Statistical power for cluster analysis Machine Learning

Cluster algorithms are gaining in popularity due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream programming languages and statistical software. While researchers can follow guidelines to choose the right algorithms, and to determine what constitutes convincing clustering, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we take a simulation approach to estimate power and classification accuracy for popular analysis pipelines. We systematically varied cluster size, number of clusters, number of different features between clusters, effect size within each different feature, and cluster covariance structure in generated datasets. We then subjected these datasets to common dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, hierarchical agglomerative clustering with Ward linkage and Euclidean distance, or average linkage and cosine distance, HDBSCAN). Furthermore, we simulated additional datasets to explore the effect of sample size and cluster separation on statistical power and classification accuracy. We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power can be achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large ({\Delta}=4). Finally, we discuss whether fuzzy clustering (c-means) could provide a more parsimonious alternative for identifying separable multivariate normal distributions, particularly those with lower centroid separation.

A fast and efficient Modal EM algorithm for Gaussian mixtures Machine Learning

In the modal approach to clustering, clusters are defined as the local maxima of the underlying probability density function, where the latter can be estimated either non-parametrically or using finite mixture models. Thus, clusters are closely related to certain regions around the density modes, and every cluster corresponds to a bump of the density. The Modal EM algorithm is an iterative procedure that can identify the local maxima of any density function. In this contribution, we propose a fast and efficient Modal EM algorithm to be used when the density function is estimated through a finite mixture of Gaussian distributions with parsimonious component-covariance structures. After describing the procedure, we apply the proposed Modal EM algorithm on both simulated and real data examples, showing its high flexibility in several contexts.

Mean shift cluster recognition method implementation in the nested sampling algorithm Machine Learning

Nested sampling is an efficient algorithm for the calculation of the Bayesian evidence and posterior parameter probability distributions. It is based on the step-by-step exploration of the parameter space by Monte Carlo sampling with a series of values sets called live points that evolve towards the region of interest, i.e. where the likelihood function is maximal. In presence of several local likelihood maxima, the algorithm converges with difficulty. Some systematic errors can also be introduced by unexplored parameter volume regions. In order to avoid this, different methods are proposed in the literature for an efficient search of new live points, even in presence of local maxima. Here we present a new solution based on the mean shift cluster recognition method implemented in a random walk search algorithm. The clustering recognition is integrated within the Bayesian analysis program NestedFit. It is tested with the analysis of some difficult cases. Compared to the analysis results without cluster recognition, the computation time is considerably reduced. At the same time, the entire parameter space is efficiently explored, which translates into a smaller uncertainty of the extracted value of the Bayesian evidence.

Fraud detection: the problem, solutions and tools


"Fraud is a billion-dollar business There are many formal definitions but essentially a fraud is an "art" and crime of deceiving and scamming people in their financial transactions. Frauds have always existed throughout human history but in this age of digital technology, the strategy, extent and magnitude of financial frauds is becoming wide-ranging -- from credit cards transactions to health benefits to insurance claims. Fraudsters are also getting super creative. Who's never received an email from a Nigerian royal widow that she's looking for trusted someone to hand over large sums of her inheritance? No wonder why is fraud a big deal.

What Artificial Intelligence Says About the Perfect Running Stride


The physiologist and coach Jack Daniels once filmed a bunch of runners in stride, then showed the footage to coaches and biomechanists to see if they could eyeball who was the most efficient. "They couldn't tell," Daniels later recalled. "No way at all." Famously awkward-looking runners like Paula Radcliffe and Alberto Salazar sometimes turn out to be extraordinarily efficient. Smooth-striding beauties sometimes finish at the back of the pack. The act of running, it turns out, is surprisingly complicated.

Machine Learning Interview Questions And Answers


Machine learning (ML) is a rising field. It offers many interesting and well-paid jobs and opportunities. Each of these and some other items might be touched in an ML interview. There is a large number of possible questions and topics. This article presents 12 general questions (with the brief answers) appropriate mainly for beginners and intermediates.

Clusters in Explanation Space: Inferring disease subtypes from model explanations Machine Learning

Identification of disease subtypes and corresponding biomarkers can substantially improve clinical diagnosis and treatment selection. Discovering these subtypes in noisy, high dimensional biomedical data is often impossible for humans and challenging for machines. We introduce a new approach to facilitate the discovery of disease subtypes: Instead of analyzing the original data, we train a diagnostic classifier (healthy vs. diseased) and extract instance-wise explanations for the classifier's decisions. The distribution of instances in the explanation space of our diagnostic classifier amplifies the different reasons for belonging to the same class - resulting in a representation that is uniquely useful for discovering latent subtypes. We compare our ability to recover subtypes via cluster analysis on model explanations to classical cluster analysis on the original data. In multiple datasets with known ground-truth subclasses, most compellingly on UK Biobank brain imaging data and transcriptome data from the Cancer Genome Atlas, we show that cluster analysis on model explanations substantially outperforms the classical approach. While we believe clustering in explanation space to be particularly valuable for inferring disease subtypes, the method is more general and applicable to any kind of sub-type identification.