A common challenge we face when performing clustering with K-Means is to find the optimal number of clusters. Naturally, the celebrated and popular Elbow method is the technique that most data scientists use to solve this particular problem. In this post, we are going to learn a more precise and less subjective approach to help us find the optimal number of clusters, the silhouette score analysis. In another post, I provide a thorough explanation of the K-Means algorithm, its subtleties, (centroid initialization, data standardization, and the number of clusters), and some pros and cons. There, I also explain when and how to use the Elbow Method.

With unsupervised learning, the idea of class attributes and explicit class membership does not exist; in fact, one of the dominant forms of unsupervised learning -- data clustering -- aims to approximate class membership by minimizing interclass instance similarity and maximizing intraclass similarity. We will have a look at 2 particular popular methods for attempting to answer this question: the elbow method and the silhouette method. It should be self-evident that, in order to plot this variance against varying numbers of clusters, varying numbers of clusters must be tested. The silhouette method measures the similarity of an object to its own cluster -- called cohesion -- when compared to other clusters -- called separation.

K-Means Algorithm seeks to find K number of clusters in a data set. This clusters have to be apart as they can be from each other and keep their elements as closely as possible. Cluster analysis is ideal to find patterns, client segmentation, and in our case, to find any similitude. However, the question is always the same: what is the K numbers that makes the number of clusters optimal?

Clustering is a fundamental skill in your Data Science toolkit. It can solve a huge array of problems -- from user segmentation to anomaly detection -- and can help your team derive very interesting insights. Determining the right number of clusters for your project is a little more art than science. In this article, I will go over a few common ways to determine the right number of clusters. The objective of this metric is to find the "Elbow" of the WSS curve in order to determine the smallest number of clusters that captures the most amount of signal in your data.