Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Neural Information Processing Systems 

Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue \lambda_{\max} . A large \lambda_{\max} indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively.