Goto

Collaborating Authors

 initialisation scheme


Orthogonal Self-Attention

Zhang, Leo, Martens, James

arXiv.org Machine Learning

Skip connections [He et al., 2016] have become an ubiquitous feature of neural network architectures from facilitating the stable training of deep models. However, despite their success, prior works [Veit et al., 2016, Gromov et al., 2024, Zhang et al., 2024] have raised the concern that the benefits of skip connections, namely ease of training, may be obscuring deeper issues, in terms of representation learning, that skip connections induce. The main point behind these criticisms is that skip connections appear to bias models away from properly utilising the full depth of their architectures. For instance, Ji et al. [2025a] argues that since skip connections continually reintroduce earlier features into deeper layers, they disrupt the learning of hierarchical and progressively more abstract representations, fundamentally harming representation learning. Motivated by this line of reasoning, we explore designing Transformers that are able to be trained stably without skip connections. Previous works [He et al., 2023, Ji et al., 2025a] have tackled this through modifications to Softmax Self-Attention (SSA) [Vaswani et al., 2017] and weight initialisations to improve signal propagation and the conditioning of the Jacobian matrix. However, these works restrict themselves to standard Softmax-based Transformers which appear to be inherently unstable without skip connections [Dong et al., 2021, Ji et al., 2025b] due to SSA.


Approximate Gaussianity Beyond Initialisation in Neural Networks

Hirst, Edward, Ramgoolam, Sanjaye

arXiv.org Artificial Intelligence

Ensembles of neural network weight matrices are studied through the training process for the MNIST classification problem, testing the efficacy of matrix models for representing their distributions, under assumptions of Gaussianity and permutation-symmetry. The general 13-parameter permutation invariant Gaussian matrix models are found to be effective models for the correlated Gaussianity in the weight matrices, beyond the range of applicability of the simple Gaussian with independent identically distributed matrix variables, and notably well beyond the initialisation step. The representation theoretic model parameters, and the graph-theoretic characterisation of the permutation invariant matrix observables give an interpretable framework for the best-fit model and for small departures from Gaussianity. Additionally, the Wasserstein distance is calculated for this class of models and used to quantify the movement of the distributions over training. Throughout the work, the effects of varied initialisation regimes, regularisation, layer depth, and layer width are tested for this formalism, identifying limits where particular departures from Gaussianity are enhanced and how more general, yet still highly-interpretable, models can be developed.



Eigenvalue initialisation and regularisation for Koopman autoencoders

Miller, Jack W., O'Neill, Charles, Constantinou, Navid C., Azencot, Omri

arXiv.org Artificial Intelligence

Regularising the parameter matrices of neural networks is ubiquitous in training deep models. Typical regularisation approaches suggest initialising weights using small random values, and to penalise weights to promote sparsity. However, these widely used techniques may be less effective in certain scenarios. Here, we study the Koopman autoencoder model which includes an encoder, a Koopman operator layer, and a decoder. These models have been designed and dedicated to tackle physics-related problems with interpretable dynamics and an ability to incorporate physics-related constraints. However, the majority of existing work employs standard regularisation practices. In our work, we take a step toward augmenting Koopman autoencoders with initialisation and penalty schemes tailored for physics-related settings. Specifically, we propose the "eigeninit" initialisation scheme that samples initial Koopman operators from specific eigenvalue distributions. In addition, we suggest the "eigenloss" penalty scheme that penalises the eigenvalues of the Koopman operator during training. We demonstrate the utility of these schemes on two synthetic data sets: a driven pendulum and flow past a cylinder; and two real-world problems: ocean surface temperatures and cyclone wind fields. We find on these datasets that eigenloss and eigeninit improves the convergence rate by up to a factor of 5, and that they reduce the cumulative long-term prediction error by up to a factor of 3. Such a finding points to the utility of incorporating similar schemes as an inductive bias in other physics-related deep learning approaches.


Exploring Low Rank Training of Deep Neural Networks

Kamalakara, Siddhartha Rao, Locatelli, Acyr, Venkitesh, Bharat, Ba, Jimmy, Gal, Yarin, Gomez, Aidan N.

arXiv.org Artificial Intelligence

Training deep neural networks in low rank, i.e. with factorised layers, is of particular interest to the community: it offers efficiency over unfactorised training in terms of both memory consumption and training time. Prior work has focused on low rank approximations of pre-trained networks and training in low rank space with additional objectives, offering various ad hoc explanations for chosen practice. We analyse techniques that work well in practice, and through extensive ablations on models such as GPT2 we provide evidence falsifying common beliefs in the field, hinting in the process at exciting research opportunities that still need answering.


Speech Modelling Using Subspace and EM Techniques

Smith, Gavin, Freitas, João F. G. de, Robinson, Tony, Niranjan, Mahesan

Neural Information Processing Systems

The speech waveform can be modelled as a piecewise-stationary linear stochastic state space system, and its parameters can be estimated using an expectation-maximisation (EM) algorithm. One problem is the initialisation of the EM algorithm. Standard initialisation schemes can lead to poor formant trajectories. But these trajectories however are important for vowel intelligibility. The aim of this paper is to investigate the suitability of subspace identification methods to initialise EM. The paper compares the subspace state space system identification (4SID) method with the EM algorithm. The 4SID and EM methods are similar in that they both estimate a state sequence (but using Kalman ters fil and Kalman smoothers respectively), and then estimate parameters (but using least-squares and maximum likelihood respectively).


Speech Modelling Using Subspace and EM Techniques

Smith, Gavin, Freitas, João F. G. de, Robinson, Tony, Niranjan, Mahesan

Neural Information Processing Systems

The speech waveform can be modelled as a piecewise-stationary linear stochastic state space system, and its parameters can be estimated using an expectation-maximisation (EM) algorithm. One problem is the initialisation of the EM algorithm. Standard initialisation schemes can lead to poor formant trajectories. But these trajectories however are important for vowel intelligibility. The aim of this paper is to investigate the suitability of subspace identification methods to initialise EM. The paper compares the subspace state space system identification (4SID) method with the EM algorithm. The 4SID and EM methods are similar in that they both estimate a state sequence (but using Kalman ters fil and Kalman smoothers respectively), and then estimate parameters (but using least-squares and maximum likelihood respectively).


Speech Modelling Using Subspace and EM Techniques

Smith, Gavin, Freitas, João F. G. de, Robinson, Tony, Niranjan, Mahesan

Neural Information Processing Systems

The speech waveform can be modelled as a piecewise-stationary linear stochastic state space system, and its parameters can be estimated using an expectation-maximisation (EM) algorithm. One problem is the initialisation ofthe EM algorithm. Standard initialisation schemes can lead to poor formant trajectories. But these trajectories however are important forvowel intelligibility. The aim of this paper is to investigate the suitability of subspace identification methods to initialise EM. The paper compares the subspace state space system identification (4SID) method with the EM algorithm. The 4SID and EM methods are similar in that they both estimate a state sequence (but using Kalman filters andKalman smoothers respectively), and then estimate parameters (but using least-squares and maximum likelihood respectively).