Goto

Collaborating Authors

 Clustering


Density-Based Clustering Exercises

#artificialintelligence

Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius. There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and "mean-shift".


Using Machine Learning To Generate Human-Readable News Articles

#artificialintelligence

TL;DR Abstract - I built ZombieWriter, a Ruby gem that will enable users to generate news articles by aggregating paragraphs from other sources. It can use either machine learning algorithms (Latent Semantic Analysis and k-means clustering) or randomization to generate human-readable articles. In this article, I demonstrate how ZombieWrtier can use machine learning to create a Markdown file containing 17 human-readable articles. After finishing the demonstration and comparing the output to a randomization process, I then explain possible "future research" plans to improve the text generation process. I am not yet ready to claim that this technology is disruptive. Machine learning is hot (to put it mildly). The paradigm of using data instead of code to program machines has been applied to solve a variety of real-world problems.


Density-Based Clustering Exercises

#artificialintelligence

Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius. There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and "mean-shift".


Image Compression using K-means Clustering : Colour Quantization

#artificialintelligence

This post is a simple yet illustrative application of K-means clustering technique. Using K-means clustering, we will perform quantization of colours present in the image which will further help in compressing the image. In a coloured image, each pixel is of size 3 bytes (RGB), where each colour can have intensity values from 0 to 255. Following combinatorics, the total number of colours which can be represented are 256*256*256. Practically, we are able to visualize only a few colours in an image.


Clustering with t-SNE, provably

arXiv.org Machine Learning

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the `early exaggeration' phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter $\alpha$ and step size $h$. Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.


Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering

@machinelearnbot

In the first part of this series, we started off rather slowly but deliberately. The previous post laid out our goals, and started off with some basic building blocks for our machine learning workflows and pipelines we will eventually get to. If you have not yet read the first installment in this series, I suggest that you do so before moving on. This time around we pick up steam, and will be doing so with an implementation of the k-means clustering algorithm. We will discuss specific aspects of k-means as they come up while coding, but if you are interested in a superficial overview of what the algorithm is about, as well as how it relates to other clustering methods, you could check this out.


K-means Clustering with R: Call Detail Record Analysis

@machinelearnbot

From the above plot, it is evident that the clusters 1, 7, and 9 have activity for all 24 hours and are the more revenue generating clusters. The clusters 1, 5, 7, 9, and 10 have activity in night hours. The cluster 5 has activity from 11.5 to 17 hours. By using this clustering mechanism, you can find the clusters making more traffic to the telecom network in the measure of total activity. Similarly, you can obtain more information like square grid and country code information to understand the square grid likely creating more revenue and more traffic to the telecom network and to target high customers based on their geo location. In the upcoming blog, we will discuss about how RFM will be used to analyze call detail records. Bio: Rathnadevi Manivannan is working as a Senior Technical Writer in Treselle Systems, experienced and passionate about writing on different technologies and domains such as Big Data, Cloud Computing, Virtualization, Storage, Data Analytics, Business Analytics.


My Data Science Apprenticeship Project

@machinelearnbot

Any author would like to know if his/her article will be successful or not. Here is an attempt to deal with this task. We crawled 5000 URLs and for each URL we downloaded the title, body of the article and parameters: number of likes (not including Facebook likes), number of comments, number of views, article creation date and date of the last comment. First, we got rid of empty (or deleted), very short (less than 100 characters long) and "not found" articles, thus getting 2000 articles with associated parameters. Then we removed articles with missing parameters and ended up with only 1207 articles. Second, for every article we conducted tokenization of words.


Ten Steps of EM Suffice for Mixtures of Two Gaussians

arXiv.org Machine Learning

The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, $\tilde{O}(d/\epsilon^2)$ samples suffice to compute the unknown vectors to within $\epsilon$ in Mahalanobis distance, where $d$ is the dimension. In particular, the error rate of the EM based estimator is $\tilde{O}\left(\sqrt{d \over n}\right)$ where $n$ is the number of samples, which is optimal up to logarithmic factors.


Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. - PubMed - NCBI

#artificialintelligence

Big data bring new opportunities for methods that efficiently summarize and automatically extract knowledge from such compendia. While both supervised learning algorithms and unsupervised clustering algorithms have been successfully applied to biological data, they are either dependent on known biology or limited to discerning the most significant signals in the data. Here we present denoising autoencoders (DAs), which employ a data-defined learning objective independent of known biology, as a method to identify and extract complex patterns from genomic data. We evaluate the performance of DAs by applying them to a large collection of breast cancer gene expression data. Results show that DAs successfully construct features that contain both clinical and molecular information.