Goto

Collaborating Authors

 Clustering


Practical Coreset Constructions for Machine Learning

arXiv.org Machine Learning

Over the last years, the world has witnessed the emergence of data sets of an unprecedented size across different scientific disciplines. The large volume of such data sets presents new challenges as gathering, storing, and analyzing them becomes expensive. In the context of millions or even billions of data points, existing proven algorithms "suddenly" become computationally infeasible while data sets may not fit on single machines anymore but must be stored on clusters of machines. As a consequence, new algorithms are required to scale to this massive data setting. While one could focus on single machine learning problems and come up with endless new algorithms, we focus on a more general approach: we investigate coresets -- succinct, small summaries of large data sets -- so that solutions found on the summary are provably competitive with solution found on the full data set.


Anti-Money Laundering and AI at HSBC Ayasdi

#artificialintelligence

HSBC and Ayasdi used Topological Data Analysis (TDA) and machine learning (ML) to automatically assemble self-similar groups of customers and customers-of-customers. This exercise was done entirely unsupervised, with Ayasdi's software making the selection of the appropriate algorithms, creating candidate groups and tuning the scenario thresholds within those groups until the optimal ones were identified. In this case, the platform automatically normalized the data columns and combined multi-dimensional scaling and single linkage clustering algorithms to create the topological model. This was then passed through an agglomerative hierarchical clustering algorithm which was optimized to produce balanced segments.


Rank-One NMF-Based Initialization for NMF and Relative Error Bounds under a Geometric Assumption

arXiv.org Machine Learning

We propose a geometric assumption on nonnegative data matrices such that under this assumption, we are able to provide upper bounds (both deterministic and probabilistic) on the relative error of nonnegative matrix factorization (NMF). The algorithm we propose first uses the geometric assumption to obtain an exact clustering of the columns of the data matrix; subsequently, it employs several rank-one NMFs to obtain the final decomposition. When applied to data matrices generated from our statistical model, we observe that our proposed algorithm produces factor matrices with comparable relative errors vis-\`a-vis classical NMF algorithms but with much faster speeds. On face image and hyperspectral imaging datasets, we demonstrate that our algorithm provides an excellent initialization for applying other NMF algorithms at a low computational cost. Finally, we show on face and text datasets that the combinations of our algorithm and several classical NMF algorithms outperform other algorithms in terms of clustering performance.


Alternatives to algebraic modeling for complex data: topological modeling via Gunnar Carlsson

@machinelearnbot

For many, mathematical modeling is exclusively about algebraic models, based on one form or another of regression or on differential equation modeling in the case of dynamical systems. However, this is too restrictive a point of view. For example, a clustering algorithm can be regarded as a modeling mechanism applicable to data where linear regression simply isn't applicable. Hierarchical clustering can also be regarded as a modeling mechanism, where the output is a dendrogram and contains information about the behavior of clusters at different levels of resolution. Kohonen self-organizing maps can similarly be regarded in this way.


Direct Mapping Hidden Excited State Interaction Patterns from ab initio Dynamics and Its Implications on Force Field Development

arXiv.org Machine Learning

The excited states of polyatomic systems are rather complex, and often exhibit meta-stable dynamical behaviors. Static analysis of reaction pathway often fails to sufficiently characterize excited state motions due to their highly non-equilibrium nature. Here, we proposed a time series guided clustering algorithm to generate most relevant meta-stable patterns directly from ab initio dynamic trajectories. Based on the knowledge of these meta-stable patterns, we suggested an interpolation scheme with only a concrete and finite set of known patterns to accurately predict the ground and excited state properties of the entire dynamics trajectories. As illustrated with the example of sinapic acids, the estimation error for both ground and excited state is very close, which indicates one could predict the ground and excited state molecular properties with similar accuracy. These results may provide us some insights to construct an excited state force field with compatible energy terms as traditional ones.


Introduction to K-means Clustering: A Tutorial

@machinelearnbot

Dr. Andrea Trevino presents a beginner introduction to the widely-used K-means clustering algorithm in this tutorial. K-means clustering is a type of unsupervised learning, which is used when the resulting categories or groups in the data are unknown. This algorithm finds the groups that exist organically in the data and the results allow the user to label new data quickly. Clustering, in general, is a key tool for understanding your data. This algorithm can be used in a number of applications, including behavioral segmentation, inventory categorization, sorting sensor measurements, and detecting bots or anomalies, to name a few. This tutorial covers the iterative algorithm that determines the clusters and works through a delivery fleet data example in Python.


Real-Time Background Subtraction Using Adaptive Sampling and Cascade of Gaussians

arXiv.org Machine Learning

Background-Foreground classification is a fundamental well-studied problem in computer vision. Due to the pixel-wise nature of modeling and processing in the algorithm, it is usually difficult to satisfy real-time constraints. There is a trade-off between the speed (because of model complexity) and accuracy. Inspired by the rejection cascade of Viola-Jones classifier, we decompose the Gaussian Mixture Model (GMM) into an adaptive cascade of classifiers. This way we achieve a good improvement in speed without compensating for accuracy. In the training phase, we learn multiple KDEs for different durations to be used as strong prior distribution and detect probable oscillating pixels which usually results in misclassifications. We propose a confidence measure for the classifier based on temporal consistency and the prior distribution. The confidence measure thus derived is used to adapt the learning rate and the thresholds of the model, to improve accuracy. The confidence measure is also employed to perform temporal and spatial sampling in a principled way. We demonstrate a speed-up factor of 5x to 10x and 17 percent average improvement in accuracy over several standard videos.


Fuzzy Approach Topic Discovery in Health and Medical Corpora

arXiv.org Machine Learning

The majority of medical documents and electronic health records (EHRs) are in text format that poses a challenge for data processing and finding relevant documents. Looking for ways to automatically retrieve the enormous amount of health and medical knowledge has always been an intriguing topic. Powerful methods have been developed in recent years to make the text processing automatic. One of the popular approaches to retrieve information based on discovering the themes in health & medical corpora is topic modeling, however, this approach still needs new perspectives. In this research we describe fuzzy latent semantic analysis (FLSA), a novel approach in topic modeling using fuzzy perspective. FLSA can handle health & medical corpora redundancy issue and provides a new method to estimate the number of topics. The quantitative evaluations show that FLSA produces superior performance and features to latent Dirichlet allocation (LDA), the most popular topic model.


Identifying the number of clusters: finally a solution

@machinelearnbot

It optimizes the number of the cluster when the clustering method is maximizing the variance among the clusters. If you are using for example K-means as clustering algorithm, your method will fail for every number of cluster you try to use! As you can see doesn't exist the right number of clusters, for this problem using the "naive" kmeans. BTW I've seen for kmeans and density based clustering algo, methods based on EM (expectation and maximizazion) and Bayesian information criterion (BIC) that are a little bit more robust than this method. Could you share the table of the points...just to play a little bit with them:)


Analytics training courses

#artificialintelligence

Includes key concepts of statistical analysis - Probability theory, Types of distribution, Central limit theorem, Hypothesis testing, Statsistical inference.