Data Clustering via Principal Direction Gap Partitioning

Abbey, Ralph, Diepenbrock, Jeremy, Langville, Amy, Meyer, Carl, Race, Shaina, Zhou, Dexin

arXiv.org Machine Learning 

Data clustering has various applications in a wide variety of fields ranging from social and biological sciences, to business, statistics, information retrieval, machine learning and data mining. Clustering refers to the process of grouping data based only on information found in the data which describes its characteristics and relationships. Although humans are generally very good at discovering patterns and classifying objects, clustering algorithms are able to discern similarities in data even when humans are not [6]. The main focus of our research has been document clustering, but we will demonstrate that our methods also work nicely on scientific data. In this paper, we propose an adaptation of the clustering algorithm known as Principal Direction Divisive Partitioning (PDDP) developed by Daniel Boley in [2] which is based Principal Components Analysis (PCA). PCA involves the eigenvector decomposition of a data covariance matrix, or equivalently a singular value decomposition (SVD) of a data matrix after mean centering. The name of our adaptation, Principal Direction Gap Partitioning (PDGP), borrows most of its name from PDDP as it follows many of the same steps that PDDP follows. The word "gap" replaces the word "divisive" in reference to how the algorithm splits data along natural gaps at each step. This concept will be further developed in the following sections, but it should be noted that PDGP is still a divisive algorithm in the same way that PDDP is.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found