AITopics | deep network training

Collaborating Authors

deep network training

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GrokAlign: Geometric Characterisation and Acceleration of Grokking

Walker, Thomas, Humayun, Ahmed Imtiaz, Balestriero, Randall, Baraniuk, Richard

arXiv.org Machine LearningAug-1-2025

A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network's functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network's Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks -- a method we introduce as GrokAlign -- which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying webpage (https://thomaswalker1.github.io/blog/grokalign.html) and code (https://github.com/ThomasWalker1/grokalign).

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Machine Learning

2506.12284

Country: North America > Canada > Ontario > Toronto (0.28)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

TensorFlow DTensor: Unified API for Distributed Deep Network Training

#artificialintelligenceMay-29-2022, 10:24:45 GMT

Recently released TensorFlow v2.9 introduces a new API for the model, data, and space-parallel (aka spatially tiled) deep network training. DTensor aims to decouple sharding directives from the model code by providing higher-level utilities to partition the model and batch parameters between devices. The work is part of the recent effort (e.g. GPipe, TF Mesh, GShard, DeepSpeed, Fairscale, ColossalAI) to decrease development time to build large-scale training workloads. Training test loss scales logarithmically with the number of network parameters, data size, and compute time for large (language) models.

api, deep network training, tensorflow dtensor, (8 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.58)

Add feedback

The Two Regimes of Deep Network Training

Leclerc, Guillaume, Madry, Aleksander

arXiv.org Machine LearningFeb-24-2020

Learning rate schedule has a major impact on the performance of deep learning models. Still, the choice of a schedule is often heuristical. We aim to develop a precise understanding of the effects of different learning rate schedules and the appropriate way to select them. To this end, we isolate two distinct phases of training, the first, which we refer to as the "large-step" regime, exhibits a rather poor performance from an optimization point of view but is the primary contributor to model generalization; the latter, "small-step" regime exhibits much more "convex-like" optimization behavior but used in isolation produces models that generalize poorly. We find that by treating these regimes separately-and em specializing our training algorithm to each one of them, we can significantly simplify learning rate schedules.

deep network training, momentum, regime, (13 more...)

arXiv.org Machine Learning

2002.10376

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Variance-Preserving Initialization Schemes Improve Deep Network Training: But Which Variance is Preserved?

Luther, Kyle, Seung, H. Sebastian

arXiv.org Machine LearningFeb-13-2019

Before training a neural net, a classic rule of thumb is to randomly initialize the weights so that the variance of the preactivation is preserved across all layers. This is traditionally interpreted using the total variance due to randomness in both networks (weights) and samples. Alternatively, one can interpret the rule of thumb as preservation of the \emph{sample} mean and variance for a fixed network, i.e., preactivation statistics computed over the random sample of training samples. The two interpretations differ little for a shallow net, but the difference is shown to be large for a deep ReLU net by decomposing the total variance into the network-averaged sum of the sample variance and square of the sample mean. We demonstrate that the latter term dominates in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width. Our experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance. We discuss the implications for the alternative rule of thumb that a network should be initialized to be at the "edge of chaos."

initialization, sample mean, variance, (13 more...)

arXiv.org Machine Learning

1902.04942

Country:

North America > United States (0.04)
Europe > Italy > Sardinia (0.04)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback