Mastering unsupervised learning opens up a broad range of avenues for a data scientist. There is so much scope in the vast expanse of unsupervised learning and yet a lot of beginners in machine learning tend to shy away from it. In fact, I'm sure most newcomers will stick to basic clustering algorithms like K-Means clustering and hierarchical clustering. While there's nothing wrong with that approach, it does limit what you can do when faced with clustering projects. And why limit yourself when you can expand your learning, knowledge, and skillset by learning the powerful DBSCAN clustering algorithm?

Clustering (cluster analysis) is grouping objects based on similarities. Clustering can be used in many areas, including machine learning, computer graphics, pattern recognition, image analysis, information retrieval, bioinformatics, and data compression. Clusters are a tricky concept, which is why there are so many different clustering algorithms. Different cluster models are employed, and for each of these cluster models, different algorithms can be given. Clusters found by one clustering algorithm will definitely be different from clusters found by a different algorithm. Grouping an unlabelled example is called clustering. As the samples are unlabelled, clustering relies on unsupervised machine learning. If the examples are labeled, then it becomes classification. Knowledge of cluster models is fundamental if you want to understand the differences between various cluster algorithms, and in this article, we're going to explore this topic in depth.

Up to know, we have explored just supervised Machine Learning algorithms and techniques to develop models where the data had label previously known. In other words, our data had some target variables with specific values that we used to train our models. However, when dealing with real-world problems, most of the time, data will not come with predefined labels, so we will want to develop machine learning models that can classify correctly this data, by finding by themselves some commonality in the features, that will be used to predict the classes on new data. In summary, the main goal is to study the intrinsic (and commonly hidden) structure of the data. This techniques can be condensed in two main types of problems that unsupervised learning tries to solve.

Godfrey, Daniel, Johns, Caley, Meyer, Carl, Race, Shaina, Sadek, Carol

Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat this problem, we created an algorithm that combined the DBSCAN algorithm and a consensus matrix. This way we are left with the tweets that are related to those dominant themes. We then used cluster analysis to find those topics that the tweets describe. We clustered the tweets using k-means, a commonly used clustering algorithm, and Non-Negative Matrix Factorization (NMF) and compared the results. The two algorithms gave similar results, but NMF proved to be faster and provided more easily interpreted results. We explored our results using two visualization tools, Gephi and Wordle.

It's a common task for a data scientist: you need to generate segments (or clusters- I'll use the terms interchangably) of the customer base. With definitions, of course!!! Clustering is the subfield of unsupervised learning that aims to partition unlabelled datasets into consistent groups based on some shared unknown characteristics. All the tools you'll need are in Scikit-Learn, so I'll leave the code to a minimum. Instead, through the medium of GIFs, this tutorial will describe the most common techniques. If GIFs aren't your thing (what are you doing on the internet?), You can download this jupyter notebook here and the gifs can be downloaded from this folder (or you can just right click on the GIFs and select'Save image as…'). Clustering algorithms can be broadly split into two types, depending on whether the number of segments is explicitly specified by the user.