Clustering is a class of Machine Learning Algorithms that looks to determine for clusters that represent similarity between groups of related data they each hold. While it is technically an Unsupervised type algorithm, in that it does not predict for a target variable, its application results in taking data that you might hypothesize has clusters that can categorize groups of related data, and forming clusters that represent data that have similarities. Thus, the effect of Clustering Algorithms could be viewed with the same effect as that of Classification Algorithms (a class type Supervised Algorithm). There are of course a number of type clustering algorithms, one being the K-Means Clustering Algorithm. The algorithm is shown as below.
Clustering is a separation of data into groups of similar objects. Every group called cluster consists of objects that are similar to one another and dissimilar to objects of other groups. In this paper, the K-Means algorithm is implemented by three distance functions and to identify the optimal distance function for clustering methods. The proposed K-Means algorithm is compared with K-Means, Static Weighted K-Means (SWK-Means) and Dynamic Weighted K-Means (DWK-Means) algorithm by using Davis Bouldin index, Execution Time and Iteration count methods. Experimental results show that the proposed K-Means algorithm performed better on Iris and Wine dataset when compared with other three clustering methods.
Clustering is a technique in machine learning that attempts to find clusters of observations within a dataset. The goal is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other. Clustering is a form of unsupervised learning because we're simply attempting to find structure within a dataset rather than predicting the value of some response variable. When this information is available, clustering can be used to identify households that are similar and may be more likely to purchase certain products or respond better to a certain type of advertising. One of the most common forms of clustering is known as k-means clustering.
To process the learning data, the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. You'll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters by reducing the in-cluster sum of squares. The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster while keeping the centroids as small as possible.
In many situations where the interest lies in identifying clusters one might expect that not all available variables carry information about these groups. Furthermore, data quality (e.g. outliers or missing entries) might present a serious and sometimes hard-to-assess problem for large and complex datasets. In this paper we show that a small proportion of atypical observations might have serious adverse effects on the solutions found by the sparse clustering algorithm of Witten and Tibshirani (2010). We propose a robustification of their sparse K-means algorithm based on the trimmed K-means algorithm of Cuesta-Albertos et al. (1997) Our proposal is also able to handle datasets with missing values. We illustrate the use of our method on microarray data for cancer patients where we are able to identify strong biological clusters with a much reduced number of genes. Our simulation studies show that, when there are outliers in the data, our robust sparse K-means algorithm performs better than other competing methods both in terms of the selection of features and also the identified clusters. This robust sparse K-means algorithm is implemented in the R package RSKC which is publicly available from the CRAN repository.