householder transformation
Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation
A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix.
PaTH Attention: Position Encoding via Accumulating Householder Transformations
Yang, Songlin, Shen, Yikang, Wen, Kaiyue, Tan, Shawn, Mishra, Mayank, Ren, Liliang, Panda, Rameswar, Kim, Yoon
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.
DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders
Siems, Julien, Carstensen, Timur, Zela, Arber, Hutter, Frank, Pontil, Massimiliano, Grazzi, Riccardo
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. While diagonal matrices used in architectures like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited expressivity. To address this, recent architectures such as (Gated) DeltaNet and RWKVv7 adopted a diagonal plus rank-1 structure, allowing simultaneous token-channel mixing, which overcomes some expressivity limitations with only a slight decrease in training efficiency. Building on the interpretation of DeltaNet's recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ($n_h$) steps per token. This naturally leads to diagonal plus rank-$n_h$ state-transition matrices, formed as products of $n_h$ generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency and a stable recurrence. Through extensive experiments, we demonstrate that DeltaProduct achieves superior state-tracking and language modeling capabilities while exhibiting significantly improved length extrapolation compared to DeltaNet. Additionally, we also strengthen the theoretical foundation of DeltaNet's expressivity by proving that it can solve dihedral group word problems in just two layers.
Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation
Dong, Wei, Sun, Yuan, Yang, Yiting, Zhang, Xing, Lin, Zhijun, Yan, Qingsen, Zhang, Haokui, Wang, Peng, Yang, Yang, Shen, Hengtao
A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.
Rotation Invariant Householder Parameterization for Bayesian PCA
Nirwan, Rajbir S., Bertschinger, Nils
We consider probabilistic PCA and related factor models from a Bayesian perspective. These models are in general not identifiable as the likelihood has a rotational symmetry. This gives rise to complicated posterior distributions with continuous subspaces of equal density and thus hinders efficiency of inference as well as interpretation of obtained parameters. In particular, posterior averages over factor loadings become meaningless and only model predictions are unambiguous. Here, we propose a parameterization based on Householder transformations, which remove the rotational symmetry of the posterior. Furthermore, by relying on results from random matrix theory, we establish the parameter distribution which leaves the model unchanged compared to the original rotationally symmetric formulation. In particular, we avoid the need to compute the Jacobian determinant of the parameter transformation. This allows us to efficiently implement probabilistic PCA in a rotation invariant fashion in any state of the art toolbox. Here, we implemented our model in the probabilistic programming language Stan and illustrate it on several examples.
Learning Structural Weight Uncertainty for Sequential Decision-Making
Zhang, Ruiyi, Li, Chunyuan, Chen, Changyou, Carin, Lawrence
Learning probability distributions on the weights of neural networks (NNs) has recently proven beneficial in many applications. Bayesian methods, such as Stein variational gradient descent (SVGD), offer an elegant framework to reason about NN model uncertainty. However, by assuming independent Gaussian priors for the individual NN weights (as often applied), SVGD does not impose prior knowledge that there is often structural information (dependence) among weights. We propose efficient posterior learning of structural weight uncertainty, within an SVGD framework, by employing matrix variate Gaussian priors on NN parameters. We further investigate the learned structural uncertainty in sequential decision-making problems, including contextual bandits and reinforcement learning. Experiments on several synthetic and real datasets indicate the superiority of our model, compared with state-of-the-art methods.
Improving Variational Auto-Encoders using Householder Flow
Tomczak, Jakub M., Welling, Max
Variational auto-encoders (VAE) are scalable and powerful generative models. However, the choice of the variational posterior determines tractability and flexibility of the VAE. Commonly, latent variables are modeled using the normal distribution with a diagonal covariance matrix. This results in computational efficiency but typically it is not flexible enough to match the true posterior distribution. One fashion of enriching the variational posterior distribution is application of normalizing flows, i.e., a series of invertible transformations to latent variables with a simple posterior. In this paper, we follow this line of thinking and propose a volume-preserving flow that uses a series of Householder transformations. We show empirically on MNIST dataset and histopathology data that the proposed flow allows to obtain more flexible variational posterior and competitive results comparing to other normalizing flows.