Dimensionality Reduction
Understanding Clinical Decision-Making in Traditional East Asian Medicine through Dimensionality Reduction: An Empirical Investigation
Bae, Hyojin, Kang, Bongsu, Kim, Chang-Eop
This study examines the clinical decision-making processes in Traditional East Asian Medicine (TEAM) by reinterpreting pattern identification (PI) through the lens of dimensionality reduction. Focusing on the Eight Principle Pattern Identification (EPPI) system and utilizing empirical data from the Shang-Han-Lun, we explore the necessity and significance of prioritizing the Exterior-Interior pattern in diagnosis and treatment selection. We test three hypotheses: whether the Ext-Int pattern contains the most information about patient symptoms, represents the most abstract and generalizable symptom information, and facilitates the selection of appropriate herbal prescriptions. Employing quantitative measures such as the abstraction index, cross-conditional generalization performance, and decision tree regression, our results demonstrate that the Exterior-Interior pattern represents the most abstract and generalizable symptom information, contributing to the efficient mapping between symptom and herbal prescription spaces. This research provides an objective framework for understanding the cognitive processes underlying TEAM, bridging traditional medical practices with modern computational approaches. The findings offer insights into the development of AI-driven diagnostic tools in TEAM and conventional medicine, with the potential to advance clinical practice, education, and research.
FedNE: Surrogate-Assisted Federated Neighbor Embedding for Dimensionality Reduction
Li, Ziwei, Wang, Xiaoqi, Chen, Hong-You, Shen, Han-Wei, Chao, Wei-Lun
Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data. In this paper, we introduce \textsc{FedNE}, a novel approach that integrates the \textsc{FedAvg} framework with the contrastive NE technique, without any requirements of shareable data. To address the lack of inter-client repulsion which is crucial for the alignment in the global embedding space, we develop a surrogate loss function that each client learns and shares with each other. Additionally, we propose a data-mixing strategy to augment the local data, aiming to relax the problems of invisible neighbors and false neighbors constructed by the local $k$NN graphs. We conduct comprehensive experiments on both synthetic and real-world datasets. The results demonstrate that our \textsc{FedNE} can effectively preserve the neighborhood data structures and enhance the alignment in the global embedding space compared to several baseline methods.
Cost-informed dimensionality reduction for structural digital twin technologies
Hughes, Aidan J., Worden, Keith, Dervilis, Nikolaos, Rogers, Timothy J.
Classification models are a key component of structural digital twin technologies used for supporting asset management decision-making. An important consideration when developing classification models is the dimensionality of the input, or feature space, used. If the dimensionality is too high, then the `curse of dimensionality' may rear its ugly head; manifesting as reduced predictive performance. To mitigate such effects, practitioners can employ dimensionality reduction techniques. The current paper formulates a decision-theoretic approach to dimensionality reduction for structural asset management. In this approach, the aim is to keep incurred misclassification costs to a minimum, as the dimensionality is reduced and discriminatory information may be lost. This formulation is constructed as an eigenvalue problem, with separabilities between classes weighted according to the cost of misclassifying them when considered in the context of a decision process. The approach is demonstrated using a synthetic case study.
Two-Stage Hierarchical and Explainable Feature Selection Framework for Dimensionality Reduction in Sleep Staging
Deng, Yangfan, Albidah, Hamad, Dallal, Ahmed, Yin, Jijun, Mao, Zhi-Hong
Sleep is crucial for human health, and EEG signals play a significant role in sleep research. Due to the high-dimensional nature of EEG signal data sequences, data visualization and clustering of different sleep stages have been challenges. To address these issues, we propose a two-stage hierarchical and explainable feature selection framework by incorporating a feature selection algorithm to improve the performance of dimensionality reduction. Inspired by topological data analysis, which can analyze the structure of high-dimensional data, we extract topological features from the EEG signals to compensate for the structural information loss that happens in traditional spectro-temporal data analysis. Supported by the topological visualization of the data from different sleep stages and the classification results, the proposed features are proven to be effective supplements to traditional features. Finally, we compare the performances of three dimensionality reduction algorithms: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). Among them, t-SNE achieved the highest accuracy of 79.8%, but considering the overall performance in terms of computational resources and metrics, UMAP is the optimal choice.
Investigating Privacy Leakage in Dimensionality Reduction Methods via Reconstruction Attack
Lumbut, Chayadon, Ponnoprat, Donlapark
Machine Learning (ML) models have become essential tools for solving complex real-world problems across various domains, including image processing, natural language processing, and business analytics. However, learning from high-dimensional data can be difficult due to the curse of dimensionality and increased computational requirements. To address these issues, dimensionality reduction methods are employed in order to reduce training costs and improve its efficiency. Popular dimensionality reduction methods include principal component analysis, t-SNE [vdMH08], and UMAP [MHM18]. These methods aim to reduce data dimensions while preserving global and local properties of the original data, ensuring that relationships between data points in higher dimensions are still reflected in lower-dimensional representations. The information retained from the original data is crucial for effective data analysis and visualization.
ADRS-CNet: An adaptive models of dimensionality reduction methods for DNA storage clustering algorithms
In the downstream information retrieval process of DNA storage technology, specific hybridization techniques, such as Polymerase Chain Reaction (PCR) or magnetic bead separation, are commonly used to access data [1]. However, this technology faces several challenges, including high base error rates (insertions, deletions, substitutions, etc.) and the loss of storage sequences, which pose significant threats to the reliability of stored data [2]. To address these issues, clustering and alignment of sequencing data can be employed. A commonly used feature extraction method is based on k-mer frequency matrices, where the dimensionality of the extracted features increases exponentially with the value of k [3] [4] [5]. Therefore, selecting an appropriate dimensionality reduction technique becomes a critical challenge that needs to be addressed. This study aims to develop an adaptive classification model to identify the optimal dimensionality reduction method, thereby mitigating the curse of dimensionality caused by k-mer feature extraction and enhancing the effectiveness of K-means clustering in restoring the original sequence information. Specifically, among the numerous available algorithms, Principal Component Analysis (PCA) [6], t-distributed Stochastic Neighbor Embedding (t-SNE) [7], and Uniform Manifold Approximation and Projection (UMAP) [8] are particularly prominent in the fields of cell biology, bioinformatics, and data visualization [9]. This study addresses the challenge of selecting the appropriate dimensionality reduction method to mitigate the curse of dimensionality in K-means clustering.
Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions
Reichmann, Luca, Hägele, David, Weiskopf, Daniel
Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets. While DR methods are often applied to typical DR benchmark data sets in the literature, they might suffer from high runtime complexity and memory requirements, making them unsuitable for large data visualization especially in environments outside of high-performance computing. To perform DR on large data sets, we propose the use of out-of-sample extensions. Such extensions allow inserting new data into existing projections, which we leverage to iteratively project data into a reference projection that consists only of a small manageable subset. This process makes it possible to perform DR out-of-core on large data, which would otherwise not be possible due to memory and runtime limitations. For metric multidimensional scaling (MDS), we contribute an implementation with out-of-sample projection capability since typical software libraries do not support it. We provide an evaluation of the projection quality of five common DR algorithms (MDS, PCA, t-SNE, UMAP, and autoencoders) using quality metrics from the literature and analyze the trade-off between the size of the reference set and projection quality. The runtime behavior of the algorithms is also quantified with respect to reference set size, out-of-sample batch size, and dimensionality of the data sets. Furthermore, we compare the out-of-sample approach to other recently introduced DR methods, such as PaCMAP and TriMAP, which claim to handle larger data sets than traditional approaches. To showcase the usefulness of DR on this large scale, we contribute a use case where we analyze ensembles of streamlines amounting to one billion projected instances.
NeurAM: nonlinear dimensionality reduction for uncertainty quantification through neural active manifolds
Zanoni, Andrea, Geraci, Gianluca, Salvador, Matteo, Marsden, Alison L., Schiavazzi, Daniele E.
We present a new approach for nonlinear dimensionality reduction, specifically designed for computationally expensive mathematical models. We leverage autoencoders to discover a one-dimensional neural active manifold (NeurAM) capturing the model output variability, plus a simultaneously learnt surrogate model with inputs on this manifold. The proposed dimensionality reduction framework can then be applied to perform outer loop many-query tasks, like sensitivity analysis and uncertainty propagation. In particular, we prove, both theoretically under idealized conditions, and numerically in challenging test cases, how NeurAM can be used to obtain multifidelity sampling estimators with reduced variance by sampling the models on the discovered low-dimensional and shared manifold among models. Several numerical examples illustrate the main features of the proposed dimensionality reduction strategy, and highlight its advantages with respect to existing approaches in the literature.
GNUMAP: A Parameter-Free Approach to Unsupervised Dimensionality Reduction via Graph Neural Networks
You, Jihee, Jeong, So Won, Donnat, Claire
With the proliferation of Graph Neural Network (GNN) methods stemming from contrastive learning, unsupervised node representation learning for graph data is rapidly gaining traction across various fields, from biology to molecular dynamics, where it is often used as a dimensionality reduction tool. However, there remains a significant gap in understanding the quality of the low-dimensional node representations these methods produce, particularly beyond well-curated academic datasets. To address this gap, we propose here the first comprehensive benchmarking of various unsupervised node embedding techniques tailored for dimensionality reduction, encompassing a range of manifold learning tasks, along with various performance metrics. We emphasize the sensitivity of current methods to hyperparameter choices -- highlighting a fundamental issue as to their applicability in real-world settings where there is no established methodology for rigorous hyperparameter selection. Addressing this issue, we introduce GNUMAP, a robust and parameter-free method for unsupervised node representation learning that merges the traditional UMAP approach with the expressivity of the GNN framework. We show that GNUMAP consistently outperforms existing state-of-the-art GNN embedding methods in a variety of contexts, including synthetic geometric datasets, citation networks, and real-world biomedical data -- making it a simple but reliable dimensionality reduction tool.
A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations
Atzberger, Daniel, Cech, Tim, Scheibel, Willy, Döllner, Jürgen, Behrisch, Michael, Schreck, Tobias
The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on two-dimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.