Goto

Collaborating Authors

 Dimensionality Reduction



PCA-Guided Autoencoding for Structured Dimensionality Reduction in Active Infrared Thermography

Salah, Mohammed, Saeed, Numan, Svetinovic, Davor, Sfarra, Stefano, Omar, Mohammed, Abdulrahman, Yusra

arXiv.org Artificial Intelligence

Active Infrared thermography (AIRT) is a widely adopted non-destructive testing (NDT) technique for detecting subsurface anomalies in industrial components. Due to the high dimensionality of AIRT data, current approaches employ non-linear autoencoders (AEs) for dimensionality reduction. However, the latent space learned by AIRT AEs lacks structure, limiting their effectiveness in downstream defect characterization tasks. To address this limitation, this paper proposes a principal component analysis guided (PCA-guided) autoencoding framework for structured dimensionality reduction to capture intricate, non-linear features in thermographic signals while enforcing a structured latent space. A novel loss function, PCA distillation loss, is introduced to guide AIRT AEs to align the latent representation with structured PCA components while capturing the intricate, non-linear patterns in thermographic signals. To evaluate the utility of the learned, structured latent space, we propose a neural network-based evaluation metric that assesses its suitability for defect characterization. Experimental results show that the proposed PCA-guided AE outperforms state-of-the-art dimensionality reduction methods on PVC, CFRP, and PLA samples in terms of contrast, signal-to-noise ratio (SNR), and neural network-based metrics.


Understanding Bias in Perceiving Dimensionality Reduction Projections

Doh, Seoyoung, Jeon, Hyeon, Shin, Sungbok, Quadri, Ghulam Jilani, Kim, Nam Wook, Seo, Jinwook

arXiv.org Artificial Intelligence

We assume a scenario where a practitioner wants to select DR techniques that produce projections suitable for analyzing their dataset. It is crucial to select projections with high faithfulness for reliable analysis, but practitioners tend to be more biased towards the visual interestingness. We verify the existence of such bias and investigate its underlying causes, providing a grounded basis for mitigating the biases. Selecting the dimensionality reduction technique that faithfully represents the structure is essential for reliable visual communication and analytics. In reality, however, practitioners favor projections for other attractions, such as aesthetics and visual saliency, over the projection's structural faithfulness, a bias we define as visual interestingness. In this research, we conduct a user study that (1) verifies the existence of such bias and (2) explains why the bias exists. Our study suggests that visual interestingness biases practitioners' preferences when selecting projections for analysis, and this bias intensifies with color-encoded labels and shorter exposure time. Based on our findings, we discuss strategies to mitigate bias in perceiving and interpreting DR projections.


How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction

Chen, Jun, Chen, Hong, Yu, Yonghua, Ying, Yiming

arXiv.org Machine Learning

In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as $512, 1024$ in our experiments), data inflation, weak augmentation, and SVD to ensure large graph connectivity and small labeling error to improve model performance.


Improving Clustering on Occupational Text Data through Dimensionality Reduction

García, Iago Xabier Vázquez, Partanaz, Damla, Yetkin, Emrullah Fatih

arXiv.org Artificial Intelligence

In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, O*NET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in O*NET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.


Enabling stratified sampling in high dimensions via nonlinear dimensionality reduction

Geraci, Gianluca, Schiavazzi, Daniele E., Zanoni, Andrea

arXiv.org Machine Learning

We consider the problem of propagating the uncertainty from a possibly large number of random inputs through a computationally expensive model. Stratified sampling is a well-known variance reduction strategy, but its application, thus far, has focused on models with a limited number of inputs due to the challenges of creating uniform partitions in high dimensions. To overcome these challenges, we perform stratification with respect to the uniform distribution defined over the unit interval, and then derive the corresponding strata in the original space using nonlinear dimensionality reduction. We show that our approach is effective in high dimensions and can be used to further reduce the variance of multifidelity Monte Carlo estimators.


Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures

Gao, Jie, Jayaram, Rajesh, Kolbe, Benedikt, Sapir, Shay, Schwiegelshohn, Chris, Silwal, Sandeep, Waingarten, Erik

arXiv.org Artificial Intelligence

Randomized dimensionality reduction is a widely-used algorithmic technique for speeding up large-scale Euclidean optimization problems. In this paper, we study dimension reduction for a variety of maximization problems, including max-matching, max-spanning tree, max TSP, as well as various measures for dataset diversity. For these problems, we show that the effect of dimension reduction is intimately tied to the \emph{doubling dimension} $λ_X$ of the underlying dataset $X$ -- a quantity measuring intrinsic dimensionality of point sets. Specifically, we prove that a target dimension of $O(λ_X)$ suffices to approximately preserve the value of any near-optimal solution,which we also show is necessary for some of these problems. This is in contrast to classical dimension reduction results, whose dependence increases with the dataset size $|X|$. We also provide empirical results validating the quality of solutions found in the projected space, as well as speedups due to dimensionality reduction.


FedNE: Surrogate-Assisted Federated Neighbor Embedding for Dimensionality Reduction

Neural Information Processing Systems

Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data. In this paper, we introduce \textsc{FedNE}, a novel approach that integrates the \textsc{FedAvg} framework with the contrastive NE technique, without any requirements of shareable data.


Multi-Criteria Dimensionality Reduction with Applications to Fairness

Neural Information Processing Systems

Dimensionality reduction is a classical technique widely used for data analysis. One foundational instantiation is Principal Component Analysis (PCA), which minimizes the average reconstruction error. In this paper, we introduce the multi-criteria dimensionality reduction problem where we are given multiple objectives that need to be optimized simultaneously. As an application, our model captures several fairness criteria for dimensionality reduction such as the Fair-PCA problem introduced by Samadi et al. [NeurIPS18] and the Nash Social Welfare (NSW) problem. In the Fair-PCA problem, the input data is divided into k groups, and the goal is to find a single d-dimensional representation for all groups for which the maximum reconstruction error of any one group is minimized.


ESPACE: Dimensionality Reduction of Activations for Model Compression

Neural Information Processing Systems

We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct of matrix multiplication associativity. Theoretical results on the construction of projection matrices with optimal computational accuracy are provided. Experimentally, we find ESPACE enables 50% compression of GPT3, Llama2, and Nemotron4 models with small accuracy degradation, as low as a 0.18 perplexity increase on GPT3-22B.