AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Density-Based Clustering Exercises

#artificialintelligenceJun-11-2017, 19:00:13 GMT

Density-based clustering is a technique that allows to partition data into groups with similar characteristics (clusters) but does not require specifying the number of those groups in advance. In density-based clustering, clusters are defined as dense regions of data points separated by low-density regions. Density is measured by the number of data points within some radius. There are different methods of density-based clustering. The most popular are DBSCAN (density-based spatial clustering of applications with noise), which assumes constant density of clusters, OPTICS (ordering points to identify the clustering structure), which allows for varying density, and "mean-shift".

artificial intelligence, density-based clustering exercise, machine learning, (8 more...)

#artificialintelligence

Country: North America > United States > California > Orange County > Irvine (0.06)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Using Machine Learning To Generate Human-Readable News Articles

#artificialintelligenceJun-11-2017, 10:25:16 GMT

TL;DR Abstract - I built ZombieWriter, a Ruby gem that will enable users to generate news articles by aggregating paragraphs from other sources. It can use either machine learning algorithms (Latent Semantic Analysis and k-means clustering) or randomization to generate human-readable articles. In this article, I demonstrate how ZombieWrtier can use machine learning to create a Markdown file containing 17 human-readable articles. After finishing the demonstration and comparing the output to a randomization process, I then explain possible "future research" plans to improve the text generation process. I am not yet ready to claim that this technology is disruptive. Machine learning is hot (to put it mildly). The paradigm of using data instead of code to program machines has been applied to solve a variety of real-world problems.

artificial intelligence, machine learning, paragraph, (15 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.36)

Add feedback

Density-Based Clustering Exercises

#artificialintelligenceJun-11-2017, 03:35:17 GMT

artificial intelligence, density-based clustering exercise, machine learning, (17 more...)

#artificialintelligence

Country: North America > United States > California > Orange County > Irvine (0.06)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Image Compression using K-means Clustering : Colour Quantization

#artificialintelligenceJun-9-2017, 15:35:22 GMT

This post is a simple yet illustrative application of K-means clustering technique. Using K-means clustering, we will perform quantization of colours present in the image which will further help in compressing the image. In a coloured image, each pixel is of size 3 bytes (RGB), where each colour can have intensity values from 0 to 255. Following combinatorics, the total number of colours which can be represented are 256*256*256. Practically, we are able to visualize only a few colours in an image.

artificial intelligence, compression, machine learning, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Clustering with t-SNE, provably

Linderman, George C., Steinerberger, Stefan

arXiv.org Machine LearningJun-8-2017

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the `early exaggeration' phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed. As a byproduct, the proof suggests novel ways for setting the exaggeration parameter $\alpha$ and step size $h$. Numerical examples illustrate the effectiveness of these rules: in particular, the quality of embedding of topological structures (e.g. the swiss roll) improves. We also discuss a connection to spectral clustering methods.

diam 2, early exaggeration phase, spectral method, (12 more...)

arXiv.org Machine Learning

1706.02582

Country: North America > United States > Connecticut > New Haven County > New Haven (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering

@machinelearnbotJun-7-2017, 15:10:06 GMT

In the first part of this series, we started off rather slowly but deliberately. The previous post laid out our goals, and started off with some basic building blocks for our machine learning workflows and pipelines we will eventually get to. If you have not yet read the first installment in this series, I suggest that you do so before moving on. This time around we pick up steam, and will be doing so with an implementation of the k-means clustering algorithm. We will discuss specific aspects of k-means as they come up while coding, but if you are interested in a superficial overview of what the algorithm is about, as well as how it relates to other clustering methods, you could check this out.

artificial intelligence, centroid, machine learning, (14 more...)

@machinelearnbot

Genre: Workflow (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

K-means Clustering with R: Call Detail Record Analysis

@machinelearnbotJun-6-2017, 18:40:13 GMT

From the above plot, it is evident that the clusters 1, 7, and 9 have activity for all 24 hours and are the more revenue generating clusters. The clusters 1, 5, 7, 9, and 10 have activity in night hours. The cluster 5 has activity from 11.5 to 17 hours. By using this clustering mechanism, you can find the clusters making more traffic to the telecom network in the measure of total activity. Similarly, you can obtain more information like square grid and country code information to understand the square grid likely creating more revenue and more traffic to the telecom network and to target high customers based on their geo location. In the upcoming blog, we will discuss about how RFM will be used to analyze call detail records. Bio: Rathnadevi Manivannan is working as a Senior Technical Writer in Treselle Systems, experienced and passionate about writing on different technologies and domains such as Big Data, Cloud Computing, Virtualization, Storage, Data Analytics, Business Analytics.

data mining, information, machine learning, (15 more...)

@machinelearnbot

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)

Add feedback

My Data Science Apprenticeship Project

@machinelearnbotJun-5-2017, 15:25:04 GMT

Any author would like to know if his/her article will be successful or not. Here is an attempt to deal with this task. We crawled 5000 URLs and for each URL we downloaded the title, body of the article and parameters: number of likes (not including Facebook likes), number of comments, number of views, article creation date and date of the last comment. First, we got rid of empty (or deleted), very short (less than 100 characters long) and "not found" articles, thus getting 2000 articles with associated parameters. Then we removed articles with missing parameters and ended up with only 1207 articles. Second, for every article we conducted tokenization of words.

artificial intelligence, data mining, machine learning, (15 more...)

@machinelearnbot

Country: North America > United States > California > San Francisco County > San Francisco (0.04)

Technology:

Information Technology > Data Science > Data Mining (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.30)

Add feedback

Ten Steps of EM Suffice for Mixtures of Two Gaussians

Daskalakis, Constantinos, Tzamos, Christos, Zampetakis, Manolis

arXiv.org Machine LearningJun-5-2017

The Expectation-Maximization (EM) algorithm is a widely used method for maximum likelihood estimation in models with latent variables. For estimating mixtures of Gaussians, its iteration can be viewed as a soft version of the k-means clustering algorithm. Despite its wide use and applications, there are essentially no known convergence guarantees for this method. We provide global convergence guarantees for mixtures of two Gaussians with known covariance matrices. We show that the population version of EM, where the algorithm is given access to infinitely many samples from the mixture, converges geometrically to the correct mean vectors, and provide simple, closed-form expressions for the convergence rate. As a simple illustration, we show that, in one dimension, ten steps of the EM algorithm initialized at infinity result in less than 1\% error estimation of the means. In the finite sample regime, we show that, under a random initialization, $\tilde{O}(d/\epsilon^2)$ samples suffice to compute the unknown vectors to within $\epsilon$ in Mahalanobis distance, where $d$ is the dimension. In particular, the error rate of the EM based estimator is $\tilde{O}\left(\sqrt{d \over n}\right)$ where $n$ is the number of samples, which is optimal up to logarithmic factors.

artificial intelligence, machine learning, tanh, (18 more...)

arXiv.org Machine Learning

1609.00368

Country: North America > United States (0.28)

Genre:

Research Report (0.50)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. - PubMed - NCBI

#artificialintelligenceJun-4-2017, 03:20:23 GMT

Big data bring new opportunities for methods that efficiently summarize and automatically extract knowledge from such compendia. While both supervised learning algorithms and unsupervised clustering algorithms have been successfully applied to biological data, they are either dependent on known biology or limited to discerning the most significant signals in the data. Here we present denoising autoencoders (DAs), which employ a data-defined learning objective independent of known biology, as a method to identify and extract complex patterns from genomic data. We evaluate the performance of DAs by applying them to a large collection of breast cancer gene expression data. Results show that DAs successfully construct features that contain both clinical and molecular information.

artificial intelligence, feature construction and knowledge extraction, machine learning, (8 more...)

#artificialintelligence

Industry:

Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.69)
Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.62)

Add feedback