cluster analysis

Mean shift cluster recognition method implementation in the nested sampling algorithm Machine Learning

Nested sampling is an efficient algorithm for the calculation of the Bayesian evidence and posterior parameter probability distributions. It is based on the step-by-step exploration of the parameter space by Monte Carlo sampling with a series of values sets called live points that evolve towards the region of interest, i.e. where the likelihood function is maximal. In presence of several local likelihood maxima, the algorithm converges with difficulty. Some systematic errors can also be introduced by unexplored parameter volume regions. In order to avoid this, different methods are proposed in the literature for an efficient search of new live points, even in presence of local maxima. Here we present a new solution based on the mean shift cluster recognition method implemented in a random walk search algorithm. The clustering recognition is integrated within the Bayesian analysis program NestedFit. It is tested with the analysis of some difficult cases. Compared to the analysis results without cluster recognition, the computation time is considerably reduced. At the same time, the entire parameter space is efficiently explored, which translates into a smaller uncertainty of the extracted value of the Bayesian evidence.

Fraud detection: the problem, solutions and tools


"Fraud is a billion-dollar business There are many formal definitions but essentially a fraud is an "art" and crime of deceiving and scamming people in their financial transactions. Frauds have always existed throughout human history but in this age of digital technology, the strategy, extent and magnitude of financial frauds is becoming wide-ranging -- from credit cards transactions to health benefits to insurance claims. Fraudsters are also getting super creative. Who's never received an email from a Nigerian royal widow that she's looking for trusted someone to hand over large sums of her inheritance? No wonder why is fraud a big deal.

What Artificial Intelligence Says About the Perfect Running Stride


The physiologist and coach Jack Daniels once filmed a bunch of runners in stride, then showed the footage to coaches and biomechanists to see if they could eyeball who was the most efficient. "They couldn't tell," Daniels later recalled. "No way at all." Famously awkward-looking runners like Paula Radcliffe and Alberto Salazar sometimes turn out to be extraordinarily efficient. Smooth-striding beauties sometimes finish at the back of the pack. The act of running, it turns out, is surprisingly complicated.

Machine Learning Interview Questions And Answers


Machine learning (ML) is a rising field. It offers many interesting and well-paid jobs and opportunities. Each of these and some other items might be touched in an ML interview. There is a large number of possible questions and topics. This article presents 12 general questions (with the brief answers) appropriate mainly for beginners and intermediates.

Clusters in Explanation Space: Inferring disease subtypes from model explanations Machine Learning

Identification of disease subtypes and corresponding biomarkers can substantially improve clinical diagnosis and treatment selection. Discovering these subtypes in noisy, high dimensional biomedical data is often impossible for humans and challenging for machines. We introduce a new approach to facilitate the discovery of disease subtypes: Instead of analyzing the original data, we train a diagnostic classifier (healthy vs. diseased) and extract instance-wise explanations for the classifier's decisions. The distribution of instances in the explanation space of our diagnostic classifier amplifies the different reasons for belonging to the same class - resulting in a representation that is uniquely useful for discovering latent subtypes. We compare our ability to recover subtypes via cluster analysis on model explanations to classical cluster analysis on the original data. In multiple datasets with known ground-truth subclasses, most compellingly on UK Biobank brain imaging data and transcriptome data from the Cancer Genome Atlas, we show that cluster analysis on model explanations substantially outperforms the classical approach. While we believe clustering in explanation space to be particularly valuable for inferring disease subtypes, the method is more general and applicable to any kind of sub-type identification.

Machine Learning – Introduction to Unsupervised Learning Vinod Sharma's Blog


Unsupervised learning helps to find a hidden jewel in data by grouping similar things together. Data have no target attribute. The algorithm takes training examples as the set of attributes/features alone. In this post, I have summarised my whole upcoming book "Unsupervised Learning – The Unlabelled Data Treasure" on one page. This one-page guide is to know everything about unsupervised learning on a high level.

Characterization and Development of Average Silhouette Width Clustering Machine Learning

The purpose of this paper is to introduced a new clustering methodology. This paper is divided into three parts. In the first part we have developed the axiomatic theory for the average silhouette width (ASW) index. There are different ways to investigate the quality and characteristics of clustering methods such as validation indices using simulations and real data experiments, model-based theory, and non-model-based theory known as the axiomatic theory. In this work we have not only taken the empirical approach of validation of clustering results through simulations, but also focus on the development of the axiomatic theory. In the second part we have presented a novel clustering methodology based on the optimization of the ASW index. We have considered the problem of estimation of number of clusters and finding clustering against this number simultaneously. Two algorithms are proposed. The proposed algorithms are evaluated against several partitioning and hierarchical clustering methods. An intensive empirical comparison of the different distance metrics on the various clustering methods is conducted. In the third part we have considered two application domains\textemdash novel single cell RNA sequencing datasets and rainfall data to cluster weather stations.

Data Science and Finance


We are now living in the age of Data Science and Big Data, as the ubiquity and availability of large amounts of data plus advances in technology to store, process, and analyze such data have revolutionized ways of thinking about things and of doing business. If you take a look at your social media accounts and wonder how these outfits are able to anticipate the kind of content you like to consume, the answer is that data science and big data analytics are being harnessed to try to guess exactly that, and with very good results. Want to buy a book from your favorite online merchant and out pop some other suggested books that you never even thought about, but you buy them anyway thanks to the prompt? You guessed it, data science and data analytics had a hand in this as well. Even the CFA Institute, the organization that grants the Chartered Financial Analyst designation globally, has for years integrated data science and big data into its curriculum, with a major emphasis on finance as the domain expertise.

Clustering by the way of atomic fission Machine Learning

Cluster analysis which focuses on the grouping and categorization of similar elements is widely used in various fields of research. Inspired by the phenomenon of atomic fission, a novel density-based clustering algorithm is proposed in this paper, called fission clustering (FC). It focuses on mining the dense families of a dataset and utilizes the information of the distance matrix to fissure clustering dataset into subsets. When we face the dataset which has a few points surround the dense families of clusters, K-nearest neighbors local density indicator is applied to distinguish and remove the points of sparse areas so as to obtain a dense subset that is constituted by the dense families of clusters. A number of frequently-used datasets were used to test the performance of this clustering approach, and to compare the results with those of algorithms. The proposed algorithm is found to outperform other algorithms in speed and accuracy.

Unsupervised Temporal Clustering to Monitor the Performance of Alternative Fueling Infrastructure Machine Learning

Zero Emission Vehicles (ZEV) play an important role in the decarbonization of the transportation sector. For a wider adoption of ZEVs, providing a reliable infrastructure is critical. We present a machine learning approach that uses unsupervised temporal clustering algorithm along with survey analysis to determine infrastructure performance and reliability of alternative fuels. We illustrate this approach for the hydrogen fueling stations in California, but this can be generalized for other regions and fuels.