tnt
Tensor Normal Training for Deep Learning Models
Despite the predominant use of first-order methods for training deep learning models, second-order methods, and in particular, natural gradient methods, remain of interest because of their potential for accelerating training through the use of curvature information. Several methods with non-diagonal preconditioning matrices, including KFAC, Shampoo, and K-BFGS, have been proposed and shown to be effective. Based on the so-called tensor normal (TN) distribution, we propose and analyze a brand new approximate natural gradient method, Tensor Normal Training (TNT), which like Shampoo, only requires knowledge of the shape of the training parameters. By approximating the probabilistically based Fisher matrix, as opposed to the empirical Fisher matrix, our method uses the block-wise covariance of the sampling based gradient as the pre-conditioning matrix. Moreover, the assumption that the sampling-based (tensor) gradient follows a TN distribution, ensures that its covariance has a Kronecker separable structure, which leads to a tractable approximation to the Fisher matrix. Consequently, TNT's memory requirements and per-iteration computational costs are only slightly higher than those for first-order methods. In our experiments, TNT exhibited superior optimization performance to state-of-the-art first-order methods, and comparable optimization performance to the state-of-the-art second-order methods KFAC and Shampoo. Moreover, TNT demonstrated its ability to generalize as well as first-order methods, while using fewer epochs.
A Some Tensor Definitions and Properties
We present in this section fairly standard notation and definitions regarding tensors, e.g., see [ Chapter 3 of [30], that we use throughout the paper. Note that when A is a matrix, this corresponds to the row-major vectorization of A . Lemma 3. Now assume that (6) holds for 1, 2,...,k 1. For k, we let H " b The proof of Theorem 1 follows from Theorem 2.8 in [ Finally, Algorithm 2 itself ensures AS.4 in Hence, by Theorem 2.8 of [44], the result is guaranteed. In Algorithm 3, we present a detailed pseudo-code for our actual implementation of TNT.
Transformer in Transformer Supplemental Material
We can see that for both DeiT -S and TNT -S, more patches are related as layer goes deeper. MLP to calculate the attention values. The attention is multiplied to all the embeddings. We extract the features from different layers of TNT to construct multi-scale features. The COCO2017 val results are shown in Table 2. TNT achieves much better Table 2: Results of Faster RCNN object detection on COCO minival set with ImageNet pre-training.
TNT: Improving Chunkwise Training for Test-Time Memorization
Li, Zeman, Behrouz, Ali, Deng, Yuan, Zhong, Peilin, Kacham, Praneeth, Karami, Mahdi, Razaviyayn, Meisam, Mirrokni, Vahab
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration - while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.
- North America > United States > California (0.14)
- Asia > Middle East > Jordan (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (2 more...)
A Some Tensor Definitions and Properties
We present in this section fairly standard notation and definitions regarding tensors, e.g., see [ Chapter 3 of [30], that we use throughout the paper. Note that when A is a matrix, this corresponds to the row-major vectorization of A . Lemma 3. Now assume that (6) holds for 1, 2,...,k 1. For k, we let H " b The proof of Theorem 1 follows from Theorem 2.8 in [ Finally, Algorithm 2 itself ensures AS.4 in Hence, by Theorem 2.8 of [44], the result is guaranteed. In Algorithm 3, we present a detailed pseudo-code for our actual implementation of TNT.
Tensor Normal Training for Deep Learning Models
Despite the predominant use of first-order methods for training deep learning models, second-order methods, and in particular, natural gradient methods, remain of interest because of their potential for accelerating training through the use of curvature information. Several methods with non-diagonal preconditioning matrices, including KFAC, Shampoo, and K-BFGS, have been proposed and shown to be effective. Based on the so-called tensor normal (TN) distribution, we propose and analyze a brand new approximate natural gradient method, Tensor Normal Training (TNT), which like Shampoo, only requires knowledge of the shape of the training parameters. By approximating the probabilistically based Fisher matrix, as opposed to the empirical Fisher matrix, our method uses the block-wise covariance of the sampling based gradient as the pre-conditioning matrix. Moreover, the assumption that the sampling-based (tensor) gradient follows a TN distribution, ensures that its covariance has a Kronecker separable structure, which leads to a tractable approximation to the Fisher matrix.