AITopics | deep residual network

Neural ordinary differential equations (neural ODEs) are a popular family of continuous-depth deep learning models. In this work, we consider a large family of parameterized ODEs with continuous-in-time parameters, which include time-dependent neural ODEs. We derive a generalization bound for this class by a Lipschitz-based argument. By leveraging the analogy between neural ODEs and deep residual networks, our approach yields in particular a generalization bound for a class of deep residual networks. The bound involves the magnitude of the difference between successive weight matrices. We illustrate numerically how this quantity affects the generalization capability of neural networks.

deep residual network, neural ordinary differential equation, ordinary differential equation, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Neural Information Processing SystemsDec-24-2025, 19:53:36 GMT

Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of the square root of the network depth. This ensures that, early in training, the function computed by normalized residual blocks in deep networks is close to the identity function (on average). We use this insight to develop a simple initialization scheme that can train deep residual networks without normalization. We also provide a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small.

batch normalization bias residual block, identity function, residual network, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Swapout: Learning an ensemble of deep architectures

Saurabh Singh, Derek Hoiem, David Forsyth

Neural Information Processing SystemsNov-21-2025, 09:40:20 GMT

When viewed as an ensemble training method, it samples a much richer set of architectures than existing methods such as dropout or stochastic depth.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback

Generalization bounds for neural ordinary differential equations and deep residual networks

Neural Information Processing SystemsOct-9-2025, 02:20:05 GMT

Neural ordinary differential equations (neural ODEs, Chen et al., 2018) are a flexible family of

artificial intelligence, machine learning, neural network, (14 more...)

Neural Information Processing Systems

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On residual network depth

Dherin, Benoit, Munn, Michael

arXiv.org Machine LearningOct-7-2025

Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.

complexity, ensemble, residual network, (14 more...)

arXiv.org Machine Learning

2510.0347

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks Spencer Frei and Yuan Cao and Quanquan Gu

Neural Information Processing SystemsOct-2-2025, 23:07:20 GMT

Compared with its rapid and widespread adoption, the theoretical understanding of why deep learning works so well has lagged significantly. This is particularly the case in the common setup of an overparameterized network, where the number of parameters in the network greatly exceeds the number of training examples and input dimension.

artificial intelligence, machine learning, neural network, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.29)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Neural Information Processing SystemsAug-17-2025, 01:39:55 GMT

This paper provides a simple explanation for why batch normalized deep residual networks are easily trainable.

batch normalization, batch size, normalization, (14 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Generalization bounds for neural ordinary differential equations and deep residual networks

Neural Information Processing SystemsJan-19-2025, 16:23:27 GMT

Neural ordinary differential equations (neural ODEs) are a popular family of continuous-depth deep learning models. In this work, we consider a large family of parameterized ODEs with continuous-in-time parameters, which include time-dependent neural ODEs. We derive a generalization bound for this class by a Lipschitz-based argument. By leveraging the analogy between neural ODEs and deep residual networks, our approach yields in particular a generalization bound for a class of deep residual networks. The bound involves the magnitude of the difference between successive weight matrices.

deep residual network, neural ordinary differential equation, ordinary differential equation, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Collaborating Authors

deep residual network

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

98ed250b203d1ac6b24bbcf263e3d4a7-Paper-Conference.pdf

6e2290dbf1e11f39d246e7ce5ac50a1e-Paper.pdf

Generalization bounds for neural ordinary differential equations and deep residual networks

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Swapout: Learning an ensemble of deep architectures

Generalization bounds for neural ordinary differential equations and deep residual networks

On residual network depth

Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks Spencer Frei and Yuan Cao and Quanquan Gu

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Generalization bounds for neural ordinary differential equations and deep residual networks