Goto

Collaborating Authors

 layer weight




A Recovery Guarantee for Sparse Neural Networks

arXiv.org Machine Learning

We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.


Depth-Aware Initialization for Stable and Efficient Neural Network Training

arXiv.org Artificial Intelligence

In past few years, various initialization schemes have been proposed. These schemes are glorot initialization, He initialization, initialization using orthogonal matrix, random walk method for initialization. Some of these methods stress on keeping unit variance of activation and gradient propagation through the network layer . Few of these methods are independent of the depth information while some methods has considered the total network depth for better initialization. In this paper, comprehensive study has been done where depth information of each layer as well as total network is incorporated for better initialization scheme. It has also been studied that for deeper networks theoretical assumption of unit variance throughout the network does not perform well. It requires the need to increase the variance of the network from first layer activation to last layer activation. W e proposed a novel way to increase the variance of the network in flexible manner, which incorporates the information of each layer depth. Experiments shows that proposed method performs better than the existing initialization scheme.


Two-Stage Regularization-Based Structured Pruning for LLMs

arXiv.org Artificial Intelligence

The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their $\ell_1$-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.


Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

arXiv.org Machine Learning

The inductive bias and generalization properties of large machine learning models are -- to a substantial extent -- a byproduct of the optimization algorithm used for training. Among others, the scale of the random initialization, the learning rate, and early stopping all have crucial impact on the quality of the model learnt by stochastic gradient descent or related algorithms. In order to understand these phenomena, we study the training dynamics of large two-layer neural networks. We use a well-established technique from non-equilibrium statistical physics (dynamical mean field theory) to obtain an asymptotic high-dimensional characterization of this dynamics. This characterization applies to a Gaussian approximation of the hidden neurons non-linearity, and empirically captures well the behavior of actual neural network models. Our analysis uncovers several interesting new phenomena in the training dynamics: $(i)$ The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity; $(ii)$ As a consequence, algorithmic inductive bias towards small complexity, but only if the initialization has small enough complexity; $(iii)$ A separation of time scales between feature learning and overfitting; $(iv)$ A non-monotone behavior of the test error and, correspondingly, a `feature unlearning' phase at large times.


On the Interplay Between Sparsity and Training in Deep Reinforcement Learning

arXiv.org Artificial Intelligence

We study the benefits of different sparse architectures for deep reinforcement learning. In particular, we focus on image-based domains where spatially-biased and fully-connected architectures are common. Using these and several other architectures of equal capacity, we show that sparse structure has a significant effect on learning performance. We also observe that choosing the best sparse architecture for a given domain depends on whether the hidden layer weights are fixed or learned.


Reviews: Towards Understanding the Importance of Shortcut Connections in Residual Networks

Neural Information Processing Systems

The paper investigates the outcome of training a one hidden layer convolutional residual network architecture using gradient descent when input is sampled from standard Gaussian distribution. As a followup of a similar analysis of Du et al (2017) for CNNs, this paper shows for ResNets that there exists two fixed points to the teacher-student loss function (network architecture is same for both). While one is a global minimum, the other is a spurious fixed point. The authors then derive *sufficient* conditions on the parameter initialization and learning rates such that training happens in two phases: 1. first phase where the hidden layer weights (w) remain away from the spurious fixed point (due to sufficiently small learning rate) while the last layer weights (a) approach the optimal value and eventually enter the region where the inner product satisfies a'a* 0. 2. second phase in which both parameters approach the global minimum such that the learning rate for w can be larger allowing faster convergence. I find this paper to be very interesting as it provides novel insights into the optimization process of ResNets even though in a very restricted setting.


RelChaNet: Neural Network Feature Selection using Relative Change Scores

arXiv.org Artificial Intelligence

There is an ongoing effort to develop feature selection algorithms to improve interpretability, reduce computational resources, and minimize overfitting in predictive models. Neural networks stand out as architectures on which to build feature selection methods, and recently, neuron pruning and regrowth have emerged from the sparse neural network literature as promising new tools. We introduce RelChaNet, a novel and lightweight feature selection algorithm that uses neuron pruning and regrowth in the input layer of a dense neural network. For neuron pruning, a gradient sum metric measures the relative change induced in a network after a feature enters, while neurons are randomly regrown. We also propose an extension that adapts the size of the input layer at runtime. Extensive experiments on nine different datasets show that our approach generally outperforms the current state-of-the-art methods, and in particular improves the average accuracy by 2% on the MNIST dataset. Feature selection is an elemental task in predictive modelling. It can serve to reduce computational resources, improve interpretability by highlighting important features, or improve predictive performance by reducing overfitting (Li et al., 2018). To further these goals has been the driving motivation of large recent efforts to improve existing and develop new feature selection algorithms.


Rethinking the adaptive relationship between Encoder Layers and Decoder Layers

arXiv.org Artificial Intelligence

In the field of machine learning, using pre-trained models to perform specific tasks is a common practice. Typically, this involves fine-tuning the pre-trained model on a specific dataset through iterative adjustments without modifying the model structure. This article focuses on the state-of-the-art (SOTA) machine translation model Helsinki-NLP/opus-mtde-en, which translates German to English, to explore the adaptive relationship between Encoder Layers and Decoder Layers by introducing a bias-free fully connected layer. Additionally, the study investigates the effects of modifying the pre-trained model structure during fine-tuning. Four experiments were conducted by introducing a bias-free fully connected layer between the Encoder and Decoder Layers: Using original pre-trained model weights and initializing the fully connected layer weights to maintain the original connections, where each Decoder Layer's input is from the 6th Encoder Layer. Through fine-tuning, these weights adapt towards optimal configurations.