Collaborating Authors

Top 5 Machine Learning Algorithms used by Data Scientists with Python: Part-1


Machine learning is an important Artificial Intelligence technique that can perform a task effectively by learning through experience. According to Forbes, machine learning will replace 25% of the jobs within the next 10 years. One of the most popular real-world applications of Machine Learning is classification. It corresponds to a task that occurs commonly in everyday life. For example, a hospital may want to classify medical patients into those who are at high, medium or low risk of acquiring a certain illness, an opinion polling company may wish to classify people interviewed into those who are likely to vote for each of several political parties or are undecided, or we may wish to classify a student project as distinction, merit, pass or fail.

Evaluating and Validating Cluster Results Machine Learning

Clustering is the technique to partition data according to their characteristics. Data that are similar in nature belong to the same cluster [1]. There are two types of evaluation methods to evaluate clustering quality. One is an external evaluation where the truth labels in the data sets are known in advance and the other is internal evaluation in which the evaluation is done with data set itself without true labels. In this paper, both external evaluation and internal evaluation are performed on the cluster results of the IRIS dataset. In the case of external evaluation Homogeneity, Correctness and V-measure scores are calculated for the dataset. For internal performance measures, the Silhouette Index and Sum of Square Errors are used. These internal performance measures along with the dendrogram (graphical tool from hierarchical Clustering) are used first to validate the number of clusters. Finally, as a statistical tool, we used the frequency distribution method to compare and provide a visual representation of the distribution of observations within a clustering result and the original data.

Stop using the Elbow Method


A common challenge we face when performing clustering with K-Means is to find the optimal number of clusters. Naturally, the celebrated and popular Elbow method is the technique that most data scientists use to solve this particular problem. In this post, we are going to learn a more precise and less subjective approach to help us find the optimal number of clusters, the silhouette score analysis. In another post, I provide a thorough explanation of the K-Means algorithm, its subtleties, (centroid initialization, data standardization, and the number of clusters), and some pros and cons. There, I also explain when and how to use the Elbow Method.

Clustering Metrics Better Than the Elbow Method - KDnuggets


Clustering is an important part of the machine learning pipeline for business or scientific enterprises utilizing data science. As the name suggests, it helps to identify congregations of closely related (by some measure of distance) data points in a blob of data, which, otherwise, would be difficult to make sense of. However, mostly, the process of clustering falls under the realm of unsupervised machine learning. And unsupervised ML is a messy business. There is no known answers or labels to guide the optimization process or measure our success against.

K-Means Clustering: Techniques to Find the Optimal Clusters


As the points are uniformly distributed, the KMeans algorithm evenly splits the points into K clusters even if there's no separation between them Gap Statistics gives the optimal number of the cluster as 10 based on the maximum gap between the cluster inertia on the data and null referenced data.