Orthogonal Self-Attention

Zhang, Leo, Martens, James

arXiv.org Machine Learning 

Skip connections [He et al., 2016] have become an ubiquitous feature of neural network architectures from facilitating the stable training of deep models. However, despite their success, prior works [Veit et al., 2016, Gromov et al., 2024, Zhang et al., 2024] have raised the concern that the benefits of skip connections, namely ease of training, may be obscuring deeper issues, in terms of representation learning, that skip connections induce. The main point behind these criticisms is that skip connections appear to bias models away from properly utilising the full depth of their architectures. For instance, Ji et al. [2025a] argues that since skip connections continually reintroduce earlier features into deeper layers, they disrupt the learning of hierarchical and progressively more abstract representations, fundamentally harming representation learning. Motivated by this line of reasoning, we explore designing Transformers that are able to be trained stably without skip connections. Previous works [He et al., 2023, Ji et al., 2025a] have tackled this through modifications to Softmax Self-Attention (SSA) [Vaswani et al., 2017] and weight initialisations to improve signal propagation and the conditioning of the Jacobian matrix. However, these works restrict themselves to standard Softmax-based Transformers which appear to be inherently unstable without skip connections [Dong et al., 2021, Ji et al., 2025b] due to SSA.