Dimensionality Reduction
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Sรธren Dahlgaard, Mathias Knudsen, Mikkel Thorup
Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if they can be trusted in the real world where they may be faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.
Model-based targeted dimensionality reduction for neuronal population data
Summarizing high-dimensional data using a small number of parameters is a ubiquitous first step in the analysis of neuronal population activity. Recently developed methods use "targeted" approaches that work by identifying multiple, distinct low-dimensional subspaces of activity that capture the population response to individual experimental task variables, such as the value of a presented stimulus or the behavior of the animal. These methods have gained attention because they decompose total neural activity into what are ostensibly different parts of a neuronal computation. However, existing targeted methods have been developed outside of the confines of probabilistic modeling, making some aspects of the procedures ad hoc, or limited in flexibility or interpretability. Here we propose a new model-based method for targeted dimensionality reduction based on a probabilistic generative model of the population response data.
Dimensionality Reduction has Quantifiable Imperfections Two Geometric Bounds
In this paper, we investigate Dimensionality reduction (DR) maps in an information retrieval setting from a quantitative topology point of view. In particular, we show that no DR maps can achieve perfect precision and perfect recall simultaneously. Thus a continuous DR map must have imperfect precision. We further prove an upper bound on the precision of Lipschitz continuous DR maps. While precision is a natural measure in an information retrieval setting, it does not measure "how" wrong the retrieved data is. We therefore propose a new measure based on Wasserstein distance that comes with similar theoretical guarantee.
2 A.1 Effect of UNet layers 2 A.2 Effect of dimensionality reduction 2 A.3 Effect of fusion strategy 2 A.4 Effect of captioner and timestep
To further understand the contributions of each component in our method as well as the impact of various design choices, we conduct a series of ablation studies on the SPair-71k dataset [7]. The quantitative results are reported in terms of PCK at different ฮบ thresholds, and we sample 20 pairs for each category. We report PCK@ฮบ (ฮบ = 0.01, 0.05, 0.10) for each setting and both the Stable Diffusion and Fuse-ViT-B/14 methods. We analyze how features extracted at different layers in the U-Net architecture affect the accuracy, specifically at layers 2, 5, and 8, for the Stable Diffusion (SD) and Fuse-ViT-B/14 methods. The experiment results in Tab. 1 suggest that layer 5 alone provides substantial performance for both the Stable Diffusion and the fused features, while gathering all three layers further improves the overall performance for the fused features.
Practical Hash Functions for Similarity Estimation and Dimensionality Reduction
Sรธren Dahlgaard, Mathias Knudsen, Mikkel Thorup
Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if they can be trusted in the real world where they may be faced with more structured input. In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.
Dimensionality Reduction for Wasserstein Barycenter
The Wasserstein barycenter is a geometric construct which captures the notion of centrality among probability distributions, and which has found many applications in machine learning. However, most algorithms for finding even an approximate barycenter suffer an exponential dependence on the dimension d of the underlying space of the distributions. In order to cope with this "curse of dimensionality," we study dimensionality reduction techniques for the Wasserstein barycenter problem. When the barycenter is restricted to support of size n, we show that randomized dimensionality reduction can be used to map the problem to a space of dimension O(log n) independent of both d and k, and that any solution found in the reduced dimension will have its cost preserved up to arbitrary small error in the original space. We provide matching upper and lower bounds on the size of the reduced dimension, showing that our methods are optimal up to constant factors. We also provide a coreset construction for the Wasserstein barycenter problem that significantly decreases the number of input distributions. The coresets can be used in conjunction with random projections and thus further improve computation time. Lastly, our experimental results validate the speedup provided by dimensionality reduction while maintaining solution quality.