Scalable manifold learning by uniform landmark sampling and constrained locally linear embedding

Peng, Dehua, Gui, Zhipeng, Wei, Wenzhang, Wu, Huayi

arXiv.org Artificial Intelligence 

Abstract: As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in highdimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the nonlandmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases. Dimension reduction plays an indispensable role in both preprocessing for machine learning tasks and visualization for high-dimensional data [1, 2]. It is often applied to address the curse of dimensionality in data science, which refers to the phenomenon where the amount of data required to achieve a certain level of accuracy increases exponentially as the number of dimensions increases [3]. This makes models difficult to represent the features comprehensively and may lead to an overfitting problem [4].