Goto

Collaborating Authors

 Clustering


Heirarchical Clustering Techniques using R

#artificialintelligence

The idea behind hierarchical cluster analysis is to show which of a (potentially large) set of samples are most similar to one another, and to group these similar samples in the same limb of a tree. Each of the samples can be thought of a sitting in an m-dimensional space, defined by the m variables (columns) in the dataframe. We define similarity on the basis of the distance between two samples in this m-dimensional space. Several different distance measures could be used, but the default is Euclidean distance and this is used to work out the distance from every sample to every other sample. This quantitative dissimilarity structure of the data is stored in a matrix produced by the "dist function".


Tensor clustering with algebraic constraints gives interpretable groups of crosstalk mechanisms in breast cancer

arXiv.org Machine Learning

Muti-dimensional datasets are now prevalent across the sciences; their ubiquity and importance will only continue to grow [1-4]. The analysis of data demands methods that preserve multidimensional structures, and that exploit them. We introduce a versatile data clustering framework based on tensors (high dimensional arrays) and algebra to analyze multidimensional datasets. One key feature of this method is that it can incorporate general, application-specific constraints on the composition of a cluster, and is guaranteed to find optimal partitions. The flexibility of the method allows it to be used directly on a dataset (i.e., as a standalone clustering tool), or in combination with other clustering methods. We apply our method on an extensive set of timecourse measurements of the activation levels of the mitogen-activated protein kinase (MAPK) and phosphoinositide 3-kinase (PI3K) pathways that are involved in cellular decisions and fates [10-13], and are known to dysfunction in cancer [10-13, 16]. The key signaling proteins and subtype responses in breast cancer cells are known; however, among genetically diverse cell lines the dysfunction varies and is not well understood [1, 15, 16]. Our objective is to find groups of cell lines whose signal transduction networks have similar dynamics. A high similarity suggests that the cell lines share pathway features that can be relevant for the responses to the ligands.


How to use machine learning to identify "good" customers vs "bad" customers - BDO Canada - IT Solutions

#artificialintelligence

Good profitable customers rarely become unprofitable. It is more likely that they were unprofitable from the onset. Determining an approach to define customer value can be a complex decision. Traditionally, we use gross margin in identifying good and bad customers. For example, if your overhead costs are 25% of gross revenue, a good customer is anyone with a gross margin over 25%.


Graying the black box: Understanding DQNs

arXiv.org Artificial Intelligence

In recent years there is a growing interest in using deep representations for reinforcement learning. In this paper, we present a methodology and tools to analyze Deep Q-networks (DQNs) in a non-blind matter. Moreover, we propose a new model, the Semi Aggregated Markov Decision Process (SAMDP), and an algorithm that learns it automatically. The SAMDP model allows us to identify spatio-temporal abstractions directly from features and may be used as a sub-goal detector in future work. Using our tools we reveal that the features learned by DQNs aggregate the state space in a hierarchical fashion, explaining its success. Moreover, we are able to understand and describe the policies learned by DQNs for three different Atari2600 games and suggest ways to interpret, debug and optimize deep neural networks in reinforcement learning.


How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms

#artificialintelligence

While there's not necessarily a "correct" answer here, it's most likely you split the bugs into four clusters. That wasn't too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare -- or a passion for entomology -- you could probably even do the same with a hundred bugs. For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them. With a hundred bugs -- there'd be many times more solutions than there are particles in the known universe. In fact, there are more than four million billion googol solutions (what's a googol?).


Machine Learning With Python - Hierarchical Clustering Advantages & Disadvantages

#artificialintelligence

Enroll in the course for free at: https://bigdatauniversity.com/courses... Machine Learning can be an incredibly beneficial tool to uncover hidden insights and predict future trends. This free Machine Learning with Python course will give you all the tools you need to get started with supervised and unsupervised learning. This #MachineLearning with #Python course dives into the basics of machine learning using an approachable, and well-known, programming language. You'll learn about Supervised vs Unsupervised Learning, look into how Statistical Modeling relates to Machine Learning, and do a comparison of each. Look at real-life examples of Machine learning and how it affects society in ways you may not have guessed!


A Quasi-Bayesian Perspective to Online Clustering

arXiv.org Machine Learning

When faced with high frequency streams of data, clustering raises theoretical and algorithmic pitfalls. We introduce a new and adaptive online clustering algorithm relying on a quasi-Bayesian approach, with a dynamic (\emph{i.e.}, time-dependent) estimation of the (unknown and changing) number of clusters. We prove that our approach is supported by minimax regret bounds. We also provide an RJMCMC-flavored implementation (called PACBO) for which we give a convergence guarantee. Finally, numerical experiments illustrate the potential of our procedure.


Fast Spectral Clustering Using Autoencoders and Landmarks

arXiv.org Machine Learning

In this paper, we introduce an algorithm for performing spectral clustering efficiently. Spectral clustering is a powerful clustering algorithm that suffers from high computational complexity, due to eigen decomposition. In this work, we first build the adjacency matrix of the corresponding graph of the dataset. To build this matrix, we only consider a limited number of points, called landmarks, and compute the similarity of all data points with the landmarks. Then, we present a definition of the Laplacian matrix of the graph that enable us to perform eigen decomposition efficiently, using a deep autoencoder. The overall complexity of the algorithm for eigen decomposition is $O(np)$, where $n$ is the number of data points and $p$ is the number of landmarks. At last, we evaluate the performance of the algorithm in different experiments.


DIMM-SC: A Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

arXiv.org Machine Learning

Motivation: Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. Methods: We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. An expectation-maximization algorithm is used for parameter inference. Results: We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.


Massive Data Clustering in Moderate Dimensions from the Dual Spaces of Observation and Attribute Data Clouds

arXiv.org Machine Learning

Cluster analysis of very high dimensional data can benefit from the properties of such high dimensionality. Informally expressed, in this work, our focus is on the analogous situation when the dimensionality is moderate to small, relative to a massively sized set of observations. Mathematically expressed, these are the dual spaces of observations and attributes. The point cloud of observations is in attribute space, and the point cloud of attributes is in observation space. In this paper, we begin by summarizing various perspectives related to methodologies that are used in multivariate analytics. We draw on these to establish an efficient clustering processing pipeline, both partitioning and hierarchical clustering.