Goto

Collaborating Authors

 Clustering


Preconditioned Spectral Clustering for Stochastic Block Partition Streaming Graph Challenge

arXiv.org Machine Learning

Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is demonstrated to efficiently solve eigenvalue problems for graph Laplacians that appear in spectral clustering. For static graph partitioning, 10-20 iterations of LOBPCG without preconditioning result in ~10x error reduction, enough to achieve 100% correctness for all Challenge datasets with known truth partitions, e.g., for graphs with 5K/.1M (50K/1M) Vertices/Edges in 2 (7) seconds, compared to over 5,000 (30,000) seconds needed by the baseline Python code. Our Python code 100% correctly determines 98 (160) clusters from the Challenge static graphs with 0.5M (2M) vertices in 270 (1,700) seconds using 10GB (50GB) of memory. Our single-precision MATLAB code calculates the same clusters at half time and memory. For streaming graph partitioning, LOBPCG is initiated with approximate eigenvectors of the graph Laplacian already computed for the previous graph, in many cases reducing 2-3 times the number of required LOBPCG iterations, compared to the static case. Our spectral clustering is generic, i.e. assuming nothing specific of the block model or streaming, used to generate the graphs for the Challenge, in contrast to the base code. Nevertheless, in 10-stage streaming comparison with the base code for the 5K graph, the quality of our clusters is similar or better starting at stage 4 (7) for emerging edging (snowballing) streaming, while the computations are over 100-1000 faster.


Clustering and Dimensionality Reduction: Understanding the "Magic" Behind Machine Learning โ€“ Blog Imperva

#artificialintelligence

These days we hear about machine learning and artificial intelligence (AI) in all aspects of life. We see machines that learn and imitate the human brain in order to automate human processes. There are autonomous cars that learn the road conditions to drive, personal assistants we can converse with and machines that can predict what stock markets will do. In some respects, it can appear as "magic." Behind machine learning there are some fundamental, well-studied and understood techniques.


Data-Driven Tree Transforms and Metrics

arXiv.org Machine Learning

We consider the analysis of high dimensional data given in the form of a matrix with columns consisting of observations and rows consisting of features. Often the data is such that the observations do not reside on a regular grid, and the given order of the features is arbitrary and does not convey a notion of locality. Therefore, traditional transforms and metrics cannot be used for data organization and analysis. In this paper, our goal is to organize the data by defining an appropriate representation and metric such that they respect the smoothness and structure underlying the data. We also aim to generalize the joint clustering of observations and features in the case the data does not fall into clear disjoint groups. For this purpose, we propose multiscale data-driven transforms and metrics based on trees. Their construction is implemented in an iterative refinement procedure that exploits the co-dependencies between features and observations. Beyond the organization of a single dataset, our approach enables us to transfer the organization learned from one dataset to another and to integrate several datasets together. We present an application to breast cancer gene expression analysis: learning metrics on the genes to cluster the tumor samples into cancer sub-types and validating the joint organization of both the genes and the samples. We demonstrate that using our approach to combine information from multiple gene expression cohorts, acquired by different profiling technologies, improves the clustering of tumor samples.


Two provably consistent divide and conquer clustering algorithms for large networks

arXiv.org Machine Learning

In this article, we advance divide-and-conquer strategies for solving the community detection problem in networks. We propose two algorithms which perform clustering on a number of small subgraphs and finally patches the results into a single clustering. The main advantage of these algorithms is that they bring down significantly the computational cost of traditional algorithms, including spectral clustering, semi-definite programs, modularity based methods, likelihood based methods etc., without losing on accuracy and even improving accuracy at times. These algorithms are also, by nature, parallelizable. Thus, exploiting the facts that most traditional algorithms are accurate and the corresponding optimization problems are much simpler in small problems, our divide-and-conquer methods provide an omnibus recipe for scaling traditional algorithms up to large networks. We prove consistency of these algorithms under various subgraph selection procedures and perform extensive simulations and real-data analysis to understand the advantages of the divide-and-conquer approach in various settings.


A probabilistic approach to emission-line galaxy classification

arXiv.org Machine Learning

We invoke a Gaussian mixture model (GMM) to jointly analyse two traditional emission-line classification schemes of galaxy ionization sources: the Baldwin-Phillips-Terlevich (BPT) and $\rm W_{H\alpha}$ vs. [NII]/H$\alpha$ (WHAN) diagrams, using spectroscopic data from the Sloan Digital Sky Survey Data Release 7 and SEAGal/STARLIGHT datasets. We apply a GMM to empirically define classes of galaxies in a three-dimensional space spanned by the $\log$ [OIII]/H$\beta$, $\log$ [NII]/H$\alpha$, and $\log$ EW(H${\alpha}$), optical parameters. The best-fit GMM based on several statistical criteria suggests a solution around four Gaussian components (GCs), which are capable to explain up to 97 per cent of the data variance. Using elements of information theory, we compare each GC to their respective astronomical counterpart. GC1 and GC4 are associated with star-forming galaxies, suggesting the need to define a new starburst subgroup. GC2 is associated with BPT's Active Galaxy Nuclei (AGN) class and WHAN's weak AGN class. GC3 is associated with BPT's composite class and WHAN's strong AGN class. Conversely, there is no statistical evidence -- based on four GCs -- for the existence of a Seyfert/LINER dichotomy in our sample. Notwithstanding, the inclusion of an additional GC5 unravels it. The GC5 appears associated to the LINER and Passive galaxies on the BPT and WHAN diagrams respectively. Subtleties aside, we demonstrate the potential of our methodology to recover/unravel different objects inside the wilderness of astronomical datasets, without lacking the ability to convey physically interpretable results. The probabilistic classifications from the GMM analysis are publicly available within the COINtoolbox (https://cointoolbox.github.io/GMM\_Catalogue/).


Identification of individual coherent sets associated with flow trajectories using Coherent Structure Coloring

arXiv.org Machine Learning

We present a method for identifying the coherent structures associated with individual Lagrangian flow trajectories even where only sparse particle trajectory data is available. The method, based on techniques in spectral graph theory, uses the Coherent Structure Coloring vector and associated eigenvectors to analyze the distance in higher-dimensional eigenspace between a selected reference trajectory and other tracer trajectories in the flow. By analyzing this distance metric in a hierarchical clustering, the coherent structure of which the reference particle is a member can be identified. This algorithm is proven successful in identifying coherent structures of varying complexities in canonical unsteady flows. Additionally, the method is able to assess the relative coherence of the associated structure in comparison to the surrounding flow. Although the method is demonstrated here in the context of fluid flow kinematics, the generality of the approach allows for its potential application to other unsupervised clustering problems in dynamical systems such as neuronal activity, gene expression, or social networks.


Adaptive Clustering Using Kernel Density Estimators

arXiv.org Machine Learning

We investigate statistical properties of a clustering algorithm that receives level set estimates from a kernel density estimator and then estimates the first split in the density level cluster tree if such a split is present or detects the absence of such a split. Key aspects of our analysis include finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for chosing the kernel bandwidth. For the rates and the adaptivity we do not need continuity assumptions on the density such as H\"older continuity, but only require intuitive geometric assumptions of non-parametric nature.


Consistency of Dirichlet Partitions

arXiv.org Machine Learning

A Dirichlet $k$-partition of a domain $U \subseteq \mathbb{R}^d$ is a collection of $k$ pairwise disjoint open subsets such that the sum of their first Laplace-Dirichlet eigenvalues is minimal. A discrete version of Dirichlet partitions has been posed on graphs with applications in data analysis. Both versions admit variational formulations: solutions are characterized by minimizers of the Dirichlet energy of mappings from $U$ into a singular space $\Sigma_k \subseteq \mathbb{R}^k$. In this paper, we extend results of N.\ Garc\'ia Trillos and D.\ Slep\v{c}ev to show that there exist solutions of the continuum problem arising as limits to solutions of a sequence of discrete problems. Specifically, a sequence of points $\{x_i\}_{i \in \mathbb{N}}$ from $U$ is sampled i.i.d.\ with respect to a given probability measure $\nu$ on $U$ and for all $n \in \mathbb{N}$, a geometric graph $G_n$ is constructed from the first $n$ points $x_1, x_2, \ldots, x_n$ and the pairwise distances between the points. With probability one with respect to the choice of points $\{x_i\}_{i \in \mathbb{N}}$, we show that as $n \to \infty$ the discrete Dirichlet energies for functions $G_n \to \Sigma_k$ $\Gamma$-converge to (a scalar multiple of) the continuum Dirichlet energy for functions $U \to \Sigma_k$ with respect to a metric coming from the theory of optimal transport. This, along with a compactness property for the aforementioned energies that we prove, implies the convergence of minimizers. When $\nu$ is the uniform distribution, our results also imply the statistical consistency statement that Dirichlet partitions of geometric graphs converge to partitions of the sampled space in the Hausdorff sense.


Towards life cycle identification of malaria parasites using machine learning and Riemannian geometry

arXiv.org Machine Learning

Malaria is a serious infectious disease that is responsible for over half million deaths yearly worldwide. The major cause of these mortalities is late or inaccurate diagnosis. Manual microscopy is currently considered as the dominant diagnostic method for malaria. However, it is time consuming and prone to human errors. The aim of this paper is to automate the diagnosis process and minimize the human intervention. We have developed the hardware and software for a cost-efficient malaria diagnostic system. This paper describes the manufactured hardware and also proposes novel software to handle parasite detection and life-stage identification. A motorized microscope is developed to take images from Giemsa-stained blood smears. A patch-based unsupervised statistical clustering algorithm is proposed which offers a novel method for classification of different regions within blood images. The proposed method provides better robustness against different imaging settings. The core of the proposed algorithm is a model called Mixture of Independent Component Analysis. A manifold based optimization method is proposed that facilitates the application of the model for high dimensional data usually acquired in medical microscopy. The method was tested on 600 blood slides with various imaging conditions. The speed of the method is higher than current supervised systems while its accuracy is comparable to or better than them.


Nice Generalization of the K-NN Clustering Algorithm -- Also Useful for Data Reduction

@machinelearnbot

You don't need to know K-NN to understand this article -- but click here if you want to learn more about it. You don't need a background in statistical science either. Let's describe this new algorithm and its various components, in simple English We are dealing here with a supervised learning problem, and more specifically, clustering (also called supervised classification.). In particular, we want to assign a class label to a new observation that does not belong to the training set. Instead of checking out individual points (the nearest neighbors) and using a majority (voting) rule to assign the new observation to a cluster based on nearest neighbor counts, we are checking out cliques of points, and focus on the nearest cliques rather than on the nearest points.