Goto

Collaborating Authors

 forward propagation


Training Transformers with 4-bit Integers

Neural Information Processing Systems

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers.


Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Neural Information Processing Systems

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.


1f9f9d8ff75205aa73ec83e543d8b571-Supplemental.pdf

Neural Information Processing Systems

We repeat the theorems presented in Sec. 3 and provide their proofs below. The theorems hold for Neumann boundary conditions, which we use in our implementation--this is achieved by the construction of the differential operators. The proofs follow the ones presented in [22]. If the activation function ฯƒ() is monotonically non-decreasing and sign-preserving, then the forward propagation through the diffusive PDE in (1) for t [0,) yields a non-increasing feature norm, that is, t kfk2 0. Proof. Let us examine the following inner product following Eq.





BBoE: Leveraging Bundle of Edges for Kinodynamic Bidirectional Motion Planning

arXiv.org Artificial Intelligence

Abstract-- In this work, we introduce BBoE, a bidirectional, kinodynamic, sampling-based motion planner that consistently and quickly finds low-cost solutions in environments with varying obstacle clutter . The algorithm combines exploration and exploitation while relying on precomputed robot state traversals, resulting in efficient convergence towards the goal. Our key contributions include: i) a strategy to navigate through obstacle-rich spaces by sorting and sequencing preprocessed forward propagations; and ii) BBoE, a robust bidirectional kinodynamic planner that utilizes this strategy to produce fast and feasible solutions. The proposed framework reduces planning time, diminishes solution cost and increases success rate in comparison to previous approaches. I. INTRODUCTION Motion planning in robotics involves identifying a series of valid configurations that a robot can assume to transition from an initial state to a desired goal state. Sampling-based planning is a popular graph-based approach used to generate robot motions by sampling discrete states and establishing connections between them via edges [23]. Their popularity is due to the inherent property of probabilistic completeness, which guarantees that a solution will be found, if one exists, as the number of sampled states reaches infinity [17], [10]. Traditionally, these techniques employ a unidirectional tree that grows from the start state and expands towards the goal region [17], [10], [6].



Supplemental Material: Efficient Neural Network Training via Forward and Backward Propagation Sparsification

Neural Information Processing Systems

This appendix can be divided into four parts. Section A gives the detailed proof of Theorem 1 and discuss the convergence of our method. Before giving the detailed proof, we would like to present the following two properties of overparam-eterized deep neural networks, which are implied by the latest studies based on the mean field theory. We will empirically verify these properties in this section and adopt them as assumptions in our proof. That's why Property 1 holds.