Goto

Collaborating Authors

 jacobian norm



On the Stability of the Jacobian Matrix in Deep Neural Networks

Dadoun, Benjamin, Hayou, Soufiane, Salam, Hanan, Seddik, Mohamed El Amine, Youssef, Pierre

arXiv.org Artificial Intelligence

Deep neural networks are known to suffer from exploding or vanishing gradients as depth increases, a phenomenon closely tied to the spectral behavior of the input-output Jacobian. Prior work has identified critical initialization schemes that ensure Jacobian stability, but these analyses are typically restricted to fully connected networks with i.i.d. weights. In this work, we go significantly beyond these limitations: we establish a general stability theorem for deep neural networks that accommodates sparsity (such as that introduced by pruning) and non-i.i.d., weakly correlated weights (e.g. induced by training). Our results rely on recent advances in random matrix theory, and provide rigorous guarantees for spectral stability in a much broader class of network models. This extends the theoretical foundation for initialization schemes in modern neural networks with structured and dependent randomness.


) for fully connected networks trained on MNIST vs. depth

Neural Information Processing Systems

We thank the reviewers for the detailed and insightful reviews. We answer most of the questions and will incorporate the feedbacks into the final version. Right: Log leading terms for spectral vs. our bound on WideResNet trained on CIFAR10 using different depths. In Figure 1, we address questions about empirical evaluation of our bounds. The primary challenge is that Theorem 5.1 requires the augmented indicators on the Jacobian norms to be themselves Lipschitz w.r.t. the hidden layers.


Limitations of Normalization in Attention Mechanism

Mudarisov, Timur, Burtsev, Mikhail, Petrova, Tatiana, State, Radu

arXiv.org Artificial Intelligence

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.



gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity

Blayney, Hugh, Arroyo, Álvaro, Dong, Xiaowen, Bronstein, Michael M.

arXiv.org Machine Learning

Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node's representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.


) for fully connected networks trained on MNIST vs. depth

Neural Information Processing Systems

We thank the reviewers for the detailed and insightful reviews. We answer most of the questions and will incorporate the feedbacks into the final version. Right: Log leading terms for spectral vs. our bound on WideResNet trained on CIFAR10 using different depths. In Figure 1, we address questions about empirical evaluation of our bounds. The primary challenge is that Theorem 5.1 requires the augmented indicators on the Jacobian norms to be themselves Lipschitz w.r.t. the hidden layers.


Towards Unraveling and Improving Generalization in World Models

Fang, Qiaoyi, Du, Weiyu, Wang, Hang, Zhang, Junshan

arXiv.org Artificial Intelligence

World models have recently emerged as a promising approach to reinforcement learning (RL), achieving state-of-the-art performance across a wide range of visual control tasks. This work aims to obtain a deep understanding of the robustness and generalization capabilities of world models. Thus motivated, we develop a stochastic differential equation formulation by treating the world model learning as a stochastic dynamical system, and characterize the impact of latent representation errors on robustness and generalization, for both cases with zero-drift representation errors and with non-zero-drift representation errors. Our somewhat surprising findings, based on both theoretic and experimental studies, reveal that for the case with zero drift, modest latent representation errors can in fact function as implicit regularization and hence result in improved robustness. We further propose a Jacobian regularization scheme to mitigate the compounding error propagation effects of non-zero drift, thereby enhancing training stability and robustness. Our experimental studies corroborate that this regularization approach not only stabilizes training but also accelerates convergence and improves accuracy of long-horizon prediction.


FedGTST: Boosting Global Transferability of Federated Models via Statistics Tuning

Ma, Evelyn, Pan, Chao, Etesami, Rasoul, Zhao, Han, Milenkovic, Olgica

arXiv.org Artificial Intelligence

The performance of Transfer Learning (TL) heavily relies on effective pretraining, which demands large datasets and substantial computational resources. As a result, executing TL is often challenging for individual model developers. Federated Learning (FL) addresses these issues by facilitating collaborations among clients, expanding the dataset indirectly, distributing computational costs, and preserving privacy. However, key challenges remain unresolved. First, existing FL methods tend to optimize transferability only within local domains, neglecting the global learning domain. Second, most approaches rely on indirect transferability metrics, which do not accurately reflect the final target loss or true degree of transferability. To address these gaps, we propose two enhancements to FL. First, we introduce a client-server exchange protocol that leverages cross-client Jacobian (gradient) norms to boost transferability. Second, we increase the average Jacobian norm across clients at the server, using this as a local regularizer to reduce cross-client Jacobian variance. Our transferable federated algorithm, termed FedGTST (Federated Global Transferability via Statistics Tuning), demonstrates that increasing the average Jacobian and reducing its variance allows for tighter control of the target loss. This leads to an upper bound on the target loss in terms of the source loss and source-target domain discrepancy. Extensive experiments on datasets such as MNIST to MNIST-M and CIFAR10 to SVHN show that FedGTST outperforms relevant baselines, including FedSR. On the second dataset pair, FedGTST improves accuracy by 9.8% over FedSR and 7.6% over FedIIR when LeNet is used as the backbone.


On progressive sharpening, flat minima and generalisation

MacDonald, Lachlan Ewen, Valmadre, Jack, Lucey, Simon

arXiv.org Machine Learning

We present a new approach to understanding the relationship between loss curvature and input-output model behaviour in deep learning. Specifically, we use existing empirical analyses of the spectrum of deep network loss Hessians to ground an ansatz tying together the loss Hessian and the input-output Jacobian over training samples during the training of deep neural networks. We then prove a series of theoretical results which quantify the degree to which the input-output Jacobian of a model approximates its Lipschitz norm over a data distribution, and deduce a novel generalisation bound in terms of the empirical Jacobian. We use our ansatz, together with our theoretical results, to give a new account of the recently observed progressive sharpening phenomenon, as well as the generalisation properties of flat minima. Experimental evidence is provided to validate our claims.