The idea behind hierarchical cluster analysis is to show which of a (potentially large) set of samples are most similar to one another, and to group these similar samples in the same limb of a tree. Each of the samples can be thought of a sitting in an m-dimensional space, defined by the m variables (columns) in the dataframe. We define similarity on the basis of the distance between two samples in this m-dimensional space. Several different distance measures could be used, but the default is Euclidean distance and this is used to work out the distance from every sample to every other sample. This quantitative dissimilarity structure of the data is stored in a matrix produced by the "dist function".
Think about the data that you are trying to cluster with. How many dimensions are you using? Are the variables highly related? DO the variables have different standard deviations? For instance, if your data is log-normal then a lot of the cases will be in the low end of the distribution with a few at the high end.
K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In'k' means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further.
Abstract-- In this paper, several two-dimensional clustering scenarios are given. In those scenarios, soft partitioning clustering algorithms (Fuzzy C-means (FCM) and Possibilistic c-means (PCM)) are applied. Afterward, VAT is used to investigate the clustering tendency visually, and then in order of checking cluster validation, three types of indices (e.g., PC, DI, and DBI) were used. After observing the clustering algorithms, it was evident that each of them has its limitations; however, PCM is more robust to noise than FCM as in case of FCM a noise point has to be considered as a member of any of the cluster. The clustering [1-3] is a subfield of data mining technique and it is very effective to pick out useful information from dataset.
Often survey analysis collects data to try to identify response patterns leading to groupings of respondents with different characteristics as revealed by answers provided to survey questions. Without additional background information on respondents, it is often very difficult (and many times impossible) to verify the accuracy of groupings resulting from the analysis. This paper examines one such situation in which high school students in low-income neighbourhood schools in Bolivia responded to a standard periodic institutional survey and responses were analysed to better understand respondents' socioeconomic contexts. In this case study, the question to be answered was "can we identify the most impoverished students based on a 22 questions standard survey alone?". With no known dependent variable and an inability to objectively capture the socioeconomic condition of the students being surveyed, the task of coming to a conclusive answer becomes unfeasible as there is no way to validate at least some portion of the students identified as most impoverished.