Goto

Collaborating Authors

 Clustering


R: K-Means Clustering- Deciding how many clusters

#artificialintelligence

In a previous lesson I showed you how to do a K-means cluster in R. You can visit that lesson here: R: K-Means Clustering. Now in that lesson I choose 3 clusters. I did that because I was the one who made up the data, so I knew 3 clusters would work well. Choosing the right number of clusters is one of the trickier parts of performing a k-means cluster.


Noisy subspace clustering via matching pursuits

arXiv.org Machine Learning

Sparsity-based subspace clustering algorithms have attracted significant attention thanks to their excellent performance in practical applications. A prominent example is the sparse subspace clustering (SSC) algorithm by Elhamifar and Vidal, which performs spectral clustering based on an adjacency matrix obtained by sparsely representing each data point in terms of all the other data points via the Lasso. When the number of data points is large or the dimension of the ambient space is high, the computational complexity of SSC quickly becomes prohibitive. Dyer et al. observed that SSC-OMP obtained by replacing the Lasso by the greedy orthogonal matching pursuit (OMP) algorithm results in significantly lower computational complexity, while often yielding comparable performance. The central goal of this paper is an analytical performance characterization of SSC-OMP for noisy data. Moreover, we introduce and analyze the SSC-MP algorithm, which employs matching pursuit (MP) in lieu of OMP. Both SSC-OMP and SSC-MP are proven to succeed even when the subspaces intersect and when the data points are contaminated by severe noise. The clustering conditions we obtain for SSC-OMP and SSC-MP are similar to those for SSC and for the thresholding-based subspace clustering (TSC) algorithm due to Heckel and B\"olcskei. Analytical results in combination with numerical results indicate that both SSC-OMP and SSC-MP with a data-dependent stopping criterion automatically detect the dimensions of the subspaces underlying the data. Moreover, experiments on synthetic and real data show that SSC-MP compares very favorably to SSC, SSC-OMP, TSC, and the nearest subspace neighbor (NSN) algorithm, both in terms of clustering performance and running time. In addition, we find that, in contrast to SSC-OMP, the performance of SSC-MP is very robust with respect to the choice of parameters in the stopping criteria.


R: K-Means Clustering

#artificialintelligence

K Means Cluster will be our introduction to Unsupervised Machine Learning. What is Unsupervised Machine Learning exactly? Well, the simplest explanation I can offer is that unlike supervised where our data set contains a result, unsupervised does not. Think of a simple regression where I have the square footage and selling prices (result) of 100 houses. Taking that data, I can easily create a prediction model that will predict the selling price of a house based off of square footage. Now, take a data set containing 100 houses with the following data: square footage, house style, garage/no garage, but no selling price.


Mega collection of data science books and terminology

@machinelearnbot

A/B Testing - In marketing, A/B testing is a simple randomized experiment with two variants, A and B, which are the control and treatment in the controlled experiment. It is a form of statistical hypothesis testing. Other names include randomized controlled experiments, online controlled experiments, and split testing. In online settings, such as web design (especially user experience design), the goal is to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement). Adaptive Boosting (AdaBoost) - AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire who won the prestigious "Gödel Prize" in 2003 for their work.


Statistical and Computational Guarantees of Lloyd's Algorithm and its Variants

arXiv.org Machine Learning

Clustering is a fundamental problem in statistics and machine learning. Lloyd's algorithm, proposed in 1957, is still possibly the most widely used clustering algorithm in practice due to its simplicity and empirical performance. However, there has been little theoretical investigation on the statistical and computational guarantees of Lloyd's algorithm. This paper is an attempt to bridge this gap between practice and theory. We investigate the performance of Lloyd's algorithm on clustering sub-Gaussian mixtures. Under an appropriate initialization for labels or centers, we show that Lloyd's algorithm converges to an exponentially small clustering error after an order of $\log n$ iterations, where $n$ is the sample size. The error rate is shown to be minimax optimal. For the two-mixture case, we only require the initializer to be slightly better than random guess. In addition, we extend the Lloyd's algorithm and its analysis to community detection and crowdsourcing, two problems that have received a lot of attention recently in statistics and machine learning. Two variants of Lloyd's algorithm are proposed respectively for community detection and crowdsourcing. On the theoretical side, we provide statistical and computational guarantees of the two algorithms, and the results improve upon some previous signal-to-noise ratio conditions in literature for both problems. Experimental results on simulated and real data sets demonstrate competitive performance of our algorithms to the state-of-the-art methods.


Using Machine Learning to Measure Job Skill Similarities

@machinelearnbot

This project involved implementing machine learning methodologies to identify similarities in job skills contained in resumes. An organization presented the project to the New York City Data Science Academy to explore whether Academy students might be interested in working on it. The three authors of this post, all students at the Academy at the time, agreed to take the project on. In formulating the analysis described in this post, the authors collaborated with several representatives of the organization. While the organization has asked us to refrain from disclosing its name at this time, the authors wish to convey their gratitude to the organization for the opportunity to work on the project as part of our studies at the Academy. The general idea underlying this project was to uncover semantic similarity and relations behind skills that appear on resumes. A semantic-based approach to evaluating job skill similarity has many potential applications that flow from an understanding of the relationships between skills found in resumes. While there are certainly other approaches to identifying semantic connections between job skills, machine learning techniques create interesting and powerful possibilities.


Spectral Clustering – How Math is Redefining Decision Making

@machinelearnbot

In today's world of big data and the internet of things, it is common for a business to find itself sitting atop a mountain of data. Possessing it is one thing, but leveraging it for data driven decision making is a much different ball game. Gut-feelings and institutionalized heuristics have traditionally been used to guide development of protocol and decision making, but the world of artificial intelligence and big disparate data is changing that. Everyone is trying to make sense of, and extract value from, their data. Those that are not will be left behind.


MCMC Louvain for Online Community Detection

arXiv.org Machine Learning

Community detection has become very popular in network analysis the last decades. Its range of applications include social sciences, biology and complex systems, such as the worldwide-web, protein-protein interactions, or social networks (see [5] for a thorough exposition of the topic). To tackle this problem, spectral approaches have been introduced in [12] or [18], inspired from the so-called spectral clustering problem (see [10]). However, the treatment of larger and larger graphs has been investigated and modularity-based algorithms has been proposed. This class of algorithms maximize a quality index called modularity, introduced in [13].


Semi-supervised Kernel Metric Learning Using Relative Comparisons

arXiv.org Machine Learning

We consider the problem of metric learning subject to a set of constraints on relative-distance comparisons between the data items. Such constraints are meant to reflect side-information that is not expressed directly in the feature vectors of the data items. The relative-distance constraints used in this work are particularly effective in expressing structures at finer level of detail than must-link (ML) and cannot-link (CL) constraints, which are most commonly used for semi-supervised clustering. Relative-distance constraints are thus useful in settings where providing an ML or a CL constraint is difficult because the granularity of the true clustering is unknown. Our main contribution is an efficient algorithm for learning a kernel matrix using the log determinant divergence --- a variant of the Bregman divergence --- subject to a set of relative-distance constraints. The learned kernel matrix can then be employed by many different kernel methods in a wide range of applications. In our experimental evaluations, we consider a semi-supervised clustering setting and show empirically that kernels found by our algorithm yield clusterings of higher quality than existing approaches that either use ML/CL constraints or a different means to implement the supervision using relative comparisons.


A Randomized Approach to Efficient Kernel Clustering

arXiv.org Machine Learning

ABSTRACT Kernel-based K-means clustering has gained popularity due to its simplicity and the power of its implicit nonlinear representation of the data. A dominant concern is the memory requirement since memory scales as the square of the number of data points. We provide a new analysis of a class of approximate kernel methods that have more modest memory requirements, and propose a specific one-pass randomized kernel approximation followed by standard K-means on the transformed data. The analysis and experiments suggest the method is accurate, while requiring drastically less memory than standard kernel K-means and significantly less memory than Nyström based approximations. Index Terms-- Kernel methods, Unsupervised learning, Lowrank approximation, Randomized algorithm 1. INTRODUCTION Kernel-based approaches are popular methods for supervised and unsupervised learning [1].