Dimensionality Reduction
Linear Dimensionality Reduction for Word Embeddings in Tabular Data Classification
Ressel, Liam, Gardi, Hamza A. A.
The Engineers' Salary Prediction Challenge requires classifying salary categories into three classes based on tabular data. The job description is represented as a 300-dimensional word embedding incorporated into the tabular features, drastically increasing dimensionality. Additionally, the limited number of training samples makes classification challenging. Linear dimensionality reduction of word embeddings for tabular data classification remains underexplored. This paper studies Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). We show that PCA, with an appropriate subspace dimension, can outperform raw embeddings. LDA without regularization performs poorly due to covariance estimation errors, but applying shrinkage improves performance significantly, even with only two dimensions. We propose Partitioned-LDA, which splits embeddings into equal-sized blocks and performs LDA separately on each, thereby reducing the size of the covariance matrices. Partitioned-LDA outperforms regular LDA and, combined with shrinkage, achieves top-10 accuracy on the competition public leaderboard. This method effectively enhances word embedding performance in tabular data classification with limited training samples.
- Europe > Germany (0.05)
- North America > United States (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.63)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Preserving Vector Space Properties in Dimensionality Reduction: A Relationship Preserving Loss Framework
Weinwurm, Eddi, Kovalenko, Alexander
Dimensionality reduction can distort vector space properties such as orthogonality and linear independence, which are critical for tasks including cross-modal retrieval, clustering, and classification. We propose a Relationship Preserving Loss (RPL), a loss function that preserves these properties by minimizing discrepancies between relationship matrices (e.g., Gram or cosine) of high-dimensional data and their low-dimensional embeddings. RPL trains neural networks for non-linear projections and is supported by error bounds derived from matrix perturbation theory. Initial experiments suggest that RPL reduces embedding dimensions while largely retaining performance on downstream tasks, likely due to its preservation of key vector space properties. While we describe here the use of RPL in dimensionality reduction, this loss can also be applied more broadly, for example to cross-domain alignment and transfer learning, knowledge distillation, fairness and invariance, dehubbing, graph and manifold learning, and federated learning, where distributed embeddings must remain geometrically consistent.
- North America > United States > New York (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Europe > Czechia > Prague (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.83)
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.82)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.44)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > Canada (0.04)
- Europe > Portugal > Coimbra > Coimbra (0.04)
- (4 more...)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.43)
PCA-Guided Autoencoding for Structured Dimensionality Reduction in Active Infrared Thermography
Salah, Mohammed, Saeed, Numan, Svetinovic, Davor, Sfarra, Stefano, Omar, Mohammed, Abdulrahman, Yusra
Active Infrared thermography (AIRT) is a widely adopted non-destructive testing (NDT) technique for detecting subsurface anomalies in industrial components. Due to the high dimensionality of AIRT data, current approaches employ non-linear autoencoders (AEs) for dimensionality reduction. However, the latent space learned by AIRT AEs lacks structure, limiting their effectiveness in downstream defect characterization tasks. To address this limitation, this paper proposes a principal component analysis guided (PCA-guided) autoencoding framework for structured dimensionality reduction to capture intricate, non-linear features in thermographic signals while enforcing a structured latent space. A novel loss function, PCA distillation loss, is introduced to guide AIRT AEs to align the latent representation with structured PCA components while capturing the intricate, non-linear patterns in thermographic signals. To evaluate the utility of the learned, structured latent space, we propose a neural network-based evaluation metric that assesses its suitability for defect characterization. Experimental results show that the proposed PCA-guided AE outperforms state-of-the-art dimensionality reduction methods on PVC, CFRP, and PLA samples in terms of contrast, signal-to-noise ratio (SNR), and neural network-based metrics.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.15)
- Europe > Austria > Vienna (0.14)
- Europe > Italy > Abruzzo > L'Aquila Province > L'Aquila (0.04)
- Energy (0.68)
- Materials > Construction Materials (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Understanding Bias in Perceiving Dimensionality Reduction Projections
Doh, Seoyoung, Jeon, Hyeon, Shin, Sungbok, Quadri, Ghulam Jilani, Kim, Nam Wook, Seo, Jinwook
We assume a scenario where a practitioner wants to select DR techniques that produce projections suitable for analyzing their dataset. It is crucial to select projections with high faithfulness for reliable analysis, but practitioners tend to be more biased towards the visual interestingness. We verify the existence of such bias and investigate its underlying causes, providing a grounded basis for mitigating the biases. Selecting the dimensionality reduction technique that faithfully represents the structure is essential for reliable visual communication and analytics. In reality, however, practitioners favor projections for other attractions, such as aesthetics and visual saliency, over the projection's structural faithfulness, a bias we define as visual interestingness. In this research, we conduct a user study that (1) verifies the existence of such bias and (2) explains why the bias exists. Our study suggests that visual interestingness biases practitioners' preferences when selecting projections for analysis, and this bias intensifies with color-encoded labels and shorter exposure time. Based on our findings, we discuss strategies to mitigate bias in perceiving and interpreting DR projections.
- North America > United States > Oklahoma (0.04)
- Europe > Portugal > Coimbra > Coimbra (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.62)
How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction
Chen, Jun, Chen, Hong, Yu, Yonghua, Ying, Yiming
In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as $512, 1024$ in our experiments), data inflation, weak augmentation, and SVD to ensure large graph connectivity and small labeling error to improve model performance.
- Asia > China > Hubei Province > Wuhan (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > Canada (0.04)
- Africa > Guinea > Kankan Region > Kankan Prefecture > Kankan (0.04)
- Research Report > New Finding (0.34)
- Research Report > Experimental Study (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.56)
Improving Clustering on Occupational Text Data through Dimensionality Reduction
García, Iago Xabier Vázquez, Partanaz, Damla, Yetkin, Emrullah Fatih
In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, O*NET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in O*NET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (5 more...)
- Education (0.68)
- Information Technology (0.46)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.84)
Enabling stratified sampling in high dimensions via nonlinear dimensionality reduction
Geraci, Gianluca, Schiavazzi, Daniele E., Zanoni, Andrea
We consider the problem of propagating the uncertainty from a possibly large number of random inputs through a computationally expensive model. Stratified sampling is a well-known variance reduction strategy, but its application, thus far, has focused on models with a limited number of inputs due to the challenges of creating uniform partitions in high dimensions. To overcome these challenges, we perform stratification with respect to the uniform distribution defined over the unit interval, and then derive the corresponding strata in the original space using nonlinear dimensionality reduction. We show that our approach is effective in high dimensions and can be used to further reduce the variance of multifidelity Monte Carlo estimators.
- North America > United States > New York (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (4 more...)
- Government > Regional Government > North America Government > United States Government (0.68)
- Energy (0.67)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Dimensionality Reduction (0.61)