Visualizing the Finer Cluster Structure of Large-Scale and High-Dimensional Data
Liang, Yu, Chaudhuri, Arin, Wang, Haoyu
Dimension reduction and visualization of high-dimensional data have become very important research topics in many scientific fields because of the rapid growth of data sets with large sample size and/or dimensions. In the literature of dimension reduction and information visualization, linear methods such as principal component analysis (PCA) [7] and classical scaling [17] mainly focus on preserving the most significant structure or maximum variance in data; nonlinear methods such as multidimensional scaling [2], isomap [16], and curvilinear component analysis (CCA) [5] mainly focus on preserving the long or short distances in the high-dimensional space. They generally perform well in preserving the global structure of data but can fail to preserve the local structure. In recent years, the manifold learning methods, such as SNE [6], Laplacian eigenmap [1], LINE [15], LARGEVIS [14], t-SNE [19] [18], and UMAP [10], have gained popularity because of their ability to preserve both the local and some aspects of the global structure of data. These methods generally assume that data lie on a low-dimensional manifold of the high-dimensional input space. They seek to find the manifold that preserves the intrinsic structure of the high-dimensional data. Many of the manifold learning methods suffer from something called the "crowding problem" while preserving local distance of high-dimensional data in low-dimensional space. This means that, if you want to describe small distances in high-dimensional space faithfully, the points with moderate or large distances between them in high-dimensional space are placed too far away from each other in low-dimensional space.
Jul-16-2020
- Country:
- North America > United States (1.00)
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.55)
- Technology: