withinss
How to cluster dataset with high dimensionality and mixed datatypes
When it comes to cluster analysis for retail and e-commerce customer data, more often than not, you will find the dataset messy, high dimensional and with many categorical variables. Although there are many dimensional reduction techniques, most of them do not work well with the dataset with many categorical variables. Traditionally, clustering approaches suffer when features are not clean numeric values. For example, the most popular algorithm KNN can only handle numeric variables. Generalized low rank models (GLRMs), developed by students at Stanford University (see Udell '16) -- propose a new clustering framework to handle all types of data even with mixed datatypes.
R: K-Means Clustering- Deciding how many clusters
In a previous lesson I showed you how to do a K-means cluster in R. You can visit that lesson here: R: K-Means Clustering. Now in that lesson I choose 3 clusters. I did that because I was the one who made up the data, so I knew 3 clusters would work well. Choosing the right number of clusters is one of the trickier parts of performing a k-means cluster.
Finding Group Structures in Data using Unsupervised Machine Learning
We will use an algorithm called k-means to find the number of natural clusters in our data set. Let's take an initial "guess" of 3 clusters to describe out dataset: This gives us a wealth of information. Each of our clusters have size 9,7, and 8 respectively. We can see that in the clustering vector. Counts of 1s is 9, 2s is 7, and 3s is 8. Clustering vector: Additionally, the mean Income for Cluster 1 is 64K and Lot Size of 18.5K sq ft.