AITopics | mntk

Appendix A and Generalization

Neural Information Processing SystemsFeb-11-2026, 19:07:54 GMT

The directional derivative of the loss function is closely related to the eigenspectrum of mNTKs. For deep models, as mentioned in (Hoffer et al., 2017), the weight distance from its initialization Combining Lemma 2 and Eq. 18, we can discover that as training iterations increase, the model's Rademacher complexity also grows with its weights more deviated from initializations, which We generally follow the settings of Liu et al. (2019) to train BERT All baselines of VGG are initialized with Kaiming initialization (He et al., 2015) and are trained with SGD for Network pruning (Frankle & Carbin, 2018; Sanh et al., 2020; Liu et al., 2021) applies various criteria MA T is the first work to employ the principal eigenvalue of mNTK as the module selection criterion. Table 5 compares the extended MA T, the vanilla BERT model, and SNIP (Lee et al., 2018b) in terms In our implementation, we apply SNIP in a modular manner by calculating the connection sensitivity of each module. In contrast, using the criteria of MA T, we prune 50% of the attention heads while training the remaining ones by MA T. This approach leads to a further acceleration of computations by 56.7% Turc et al. (2019), we apply the proposed MA T to BERT models with different network scales, namely

artificial intelligence, eigenvalue, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

516fd05dc408fd6d6374940a83930193-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 19:07:51 GMT

eigenvalue, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Michigan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Neural Information Processing SystemsDec-25-2025, 05:43:10 GMT

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $\lambda_{\max}$. A large $\lambda_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $\lambda_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.

modular adaptive training, name change, over-parameterized model, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

516fd05dc408fd6d6374940a83930193-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 16:47:11 GMT

artificial intelligence, eigenvalue, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Neural Information Processing SystemsOct-8-2025, 16:47:08 GMT

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training.

eigenvalue, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Michigan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Neural Information Processing SystemsJan-18-2025, 02:41:28 GMT

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue \lambda_{\max} . A large \lambda_{\max} indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively.

modular adaptive training, module, over-parameterized model, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Shi, Yubin, Chen, Yixuan, Dong, Mingzhi, Yang, Xiaochen, Li, Dongsheng, Wang, Yujiang, Dick, Robert P., Lv, Qin, Zhao, Yingying, Yang, Fan, Lu, Tun, Gu, Ning, Shang, Li

arXiv.org Artificial IntelligenceMay-13-2024

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $\lambda_{\max}$. A large $\lambda_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $\lambda_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.

eigenvalue, mntk, module, (15 more...)

arXiv.org Artificial Intelligence

2405.07527

Country:

Asia > China > Shanghai > Shanghai (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Michigan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

mntk

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Appendix A and Generalization

516fd05dc408fd6d6374940a83930193-Paper-Conference.pdf

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

516fd05dc408fd6d6374940a83930193-Supplemental-Conference.pdf

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models