Goto

Collaborating Authors

 Clustering


Tensor Clustering with Planted Structures: Statistical Optimality and Computational Limits

arXiv.org Machine Learning

This paper studies the statistical and computational limits of high-order clustering with planted structures. We focus on two clustering models, constant high-order clustering (CHC) and rank-one higher-order clustering (ROHC), and study the methods and theory for testing whether a cluster exists (detection) and identifying the support of cluster (recovery). Specifically, we identify the sharp boundaries of signal-to-noise ratio for which CHC and ROHC detection/recovery are statistically possible. We also develop the tight computational thresholds: when the signal-to-noise ratio is below these thresholds, we prove that polynomial-time algorithms cannot solve these problems under the computational hardness conjectures of hypergraphic planted clique (HPC) detection and hypergraphic planted dense subgraph (HPDS) recovery. We also propose polynomial-time tensor algorithms that achieve reliable detection and recovery when the signal-to-noise ratio is above these thresholds. Both sparsity and tensor structures yield the computational barriers in high-order tensor clustering. The interplay between them results in significant differences between high-order tensor clustering and matrix clustering in literature in aspects of statistical and computational phase transition diagrams, algorithmic approaches, hardness conjecture, and proof techniques. To our best knowledge, we are the first to give a thorough characterization of the statistical and computational trade-off for such a double computational-barrier problem. Finally, we provide evidence for the computational hardness conjectures of HPC detection and HPDS recovery.


reval: a Python package to determine the best number of clusters with stability-based relative clustering validation

arXiv.org Machine Learning

Determining the number of clusters that best partitions a dataset can be a challenging task because of 1) the lack of a priori information within an unsupervised learning framework; and 2) the absence of a unique clustering validation approach to evaluate clustering solutions. Here we present reval: a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions. Statistical software, both in R and Python, usually rely on internal validation metrics, such as the silhouette index, to select the number of clusters that best fits the data. Meanwhile, open-source software solutions that easily implement relative clustering techniques are lacking. Internal validation methods exploit characteristics of the data itself to produce a result, whereas relative approaches attempt to leverage the unknown underlying distribution of data points looking for a replicable and generalizable clustering solution. The implementation of relative validation solutions can further the theory of clustering by enriching the already available methods that can be used to investigate clustering results in different situations and for different data distributions. This work aims at contributing to this effort by developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data. The package works with multiple clustering and classification algorithms, hence allowing further assessment of the stability of different clustering mechanisms.


K Means clustering with python code explained

#artificialintelligence

K means clustering is another simplified algorithm in machine learning. It is categorized into unsupervised learning because here we don't know the result already (no idea about which cluster will be formed). This algorithm is used for vector quantization of the data and has been taken from signal processing methodology. Here the data is divided into several groups, data points in each group have similar characteristics. These clusters are decided by calculating the distance between data points.


Numerical simulation, clustering and prediction of multi-component polymer precipitation

arXiv.org Machine Learning

Multi-component polymer systems are of interest in organic photovoltaic and drug delivery applications, among others where diverse morphologies influence performance. An improved understanding of morphology classification, driven by composition-informed prediction tools, will aid polymer engineering practice. We use a modified Cahn-Hilliard model to simulate polymer precipitation. Such physics-based models require high-performance computations that prevent rapid prototyping and iteration in engineering settings. To reduce the required computational costs, we apply machine learning techniques for clustering and consequent prediction of the simulated polymer blend images in conjunction with simulations. Integrating ML and simulations in such a manner reduces the number of simulations needed to map out the morphology of polymer blends as a function of input parameters and also generates a data set which can be used by others to this end. We explore dimensionality reduction, via principal component analysis and autoencoder techniques, and analyse the resulting morphology clusters. Supervised machine learning using Gaussian process classification was subsequently used to predict morphology clusters according to species molar fraction and interaction parameter inputs. Manual pattern clustering yielded the best results, but machine learning techniques were able to predict the morphology of polymer blends with $\geq$ 90 $\%$ accuracy.


A Three-Stage Algorithm for the Large Scale Dynamic Vehicle Routing Problem with an Industry 4.0 Approach

arXiv.org Artificial Intelligence

Industry 4.0 is a concept which helps companies to have a smart supply chain system when they are faced with a dynamic process. As Industry 4.0 focuses on mobility and real-time integration, it is a good framework for a Dynamic Vehicle Routing problem (DVRP). The main objective of this research is to solve the DVRP on a large-scale size. The aim of this study is to show that the delivery vehicles must serve customer demands from a common depot to have a minimum transit cost without exceeding the capacity constraint of each vehicle. In VRP, to reach an exact solution is quite difficult, and in large-size real world problems it is often impossible. Also, the computational time complexity of this type of problem grows exponentially. In order to find optimal answers for this problem in medium and large dimensions, using a heuristic approach is recommended as the best approach. A hierarchical approach consisting of three stages as cluster-first, route-construction second, route-improvement third is proposed. In the first stage, customers are clustered based on the number of vehicles with different clustering algorithms (i.e., K-mean, GMM, and BIRCH algorithms). In the second stage, the DVRP is solved using construction algorithms and in the third stage improvement algorithms are applied. The second stage is solved using construction algorithms (i.e. Savings algorithm, path cheapest arc algorithm, etc.). In the third stage, improvement algorithms such as Guided Local Search, Simulated Annealing and Tabu Search are applied. One of the main contributions of this paper is that the proposed approach can deal with large-size real world problems to decrease the computational time complexity. The results of this approach confirmed that the proposed methodology is applicable.


How to perform Affinity Propagation with Python in Scikit? โ€“ MachineCurve

#artificialintelligence

Say you've got a dataset where there exist relationships between individual samples, and your goal is to identify groups of related samples within the dataset. Clustering, which is part of the class of unsupervised machine learning algorithms, is then the way to go. But what clustering algorithm to apply when you do not really know the number of clusters? Enter Affinity Propagation, a gossip-style algorithm which derives the number of clusters by mimicing social group formation by passing messages about the popularity of individual samples as to whether they're part of a certain group, or even if they are the leader of one. This algorithm, which can estimate the number of clusters/groups in your dataset itself, is the topic of today's blog post.


Modeling Cell Populations Measured By Flow Cytometry With Covariates Using Sparse Mixture of Regressions

arXiv.org Machine Learning

The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry, which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real-time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small and large scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper, we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the north-east Pacific in the spring of 2017.


Navigating the Landscape of Multiplayer Games to Probe the Drosophila of AI

arXiv.org Artificial Intelligence

Multiplayer games have a long history in being used as key testbeds for evaluation and training in artificial intelligence (AI), aptly referred to as the "Drosophila of AI". Traditionally, researchers have focused on using games to build strong AI agents that, e.g., achieve human-level performance. This progress, however, also requires a classification of how 'interesting' a game is for an artificial agent, which requires characterization of games and their topological landscape. Tackling this latter question not only facilitates an understanding of the characteristics of learnt AI agents in games, but can also help determine what game an AI should address next as part of its training. Here, we show how network measures applied to so-called response graphs of large-scale games enable the creation of a useful landscape of games, quantifying the relationships between games of widely varying sizes, characteristics, and complexities. We illustrate our findings in various domains, ranging from well-studied canonical games to significantly more complex empirical games capturing the performance of trained AI agents pitted against one another. Our results culminate in a demonstration of how one can leverage this information to automatically generate new and interesting games, including mixtures of empirical games synthesized from real world games.


On sampling from data with duplicate records

arXiv.org Machine Learning

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.


Multi-view Graph Learning by Joint Modeling of Consistency and Inconsistency

arXiv.org Machine Learning

Graph learning has emerged as a promising technique for multi-view clustering with its ability to learn a unified and robust graph from multiple views. However, existing graph learning methods mostly focus on the multi-view consistency issue, yet often neglect the inconsistency across multiple views, which makes them vulnerable to possibly low-quality or noisy datasets. To overcome this limitation, we propose a new multi-view graph learning framework, which for the first time simultaneously and explicitly models multi-view consistency and multi-view inconsistency in a unified objective function, through which the consistent and inconsistent parts of each single-view graph as well as the unified graph that fuses the consistent parts can be iteratively learned. Though optimizing the objective function is NP-hard, we design a highly efficient optimization algorithm which is able to obtain an approximate solution with linear time complexity in the number of edges in the unified graph. Furthermore, our multi-view graph learning approach can be applied to both similarity graphs and dissimilarity graphs, which lead to two graph fusion-based variants in our framework. Experiments on twelve multi-view datasets have demonstrated the robustness and efficiency of the proposed approach.