AITopics | small initialization

From Condensation to Rank Collapse: ATwo-Stage Analysis of Transformer Training Dynamics

Neural Information Processing SystemsJun-16-2026, 13:18:22 GMT

Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in Zhou et al. [2022] to systematically investigate linearized Transformer training dynamics.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent

Neural Information Processing SystemsApr-30-2026, 06:24:27 GMT

Low-rank matrix factorization (LRMF) is a canonical problem in non-convex optimization, the objective function to be minimized is non-convex and even non-smooth, which makes the global convergence guarantee of gradient-based algorithm quite challenging.

artificial intelligence, initialization, machine learning, (16 more...)

Neural Information Processing Systems

Country: Asia (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.45)

Add feedback

21c426323068204f4199c490d730e88e-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 20:32:03 GMT

artificial intelligence, machine learning, probability, (16 more...)

Neural Information Processing Systems

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Towards Understanding the Condensation of Neural Networks at Initial Training

Neural Information Processing SystemsApr-24-2026, 14:16:15 GMT

Empirical works show that for ReLU neural networks (NNs) with small initialization, input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) condense onto isolated orientations. The condensation dynamics implies that the training implicitly regularizes a NN towards one with much smaller effective size. In this work, we illustrate the formation of the condensation in multi-layer fully connected NNs and show that the maximal number of condensed orientations in the initial training stage is twice the multiplicity of the activation function, where "multiplicity" indicates the multiple roots of activation function at origin. Our theoretical analysis confirms experiments for two cases, one is for the activation function of multiplicity one with arbitrary dimension input, which contains many common activation functions, and the other is for the layer with one-dimensional input and arbitrary multiplicity. This work makes a step towards understanding how small initialization leads NNs to condensation at the initial training stage.

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.16)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

f02f1185b97518ab5bd7ebde466992d3-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 21:40:35 GMT

artificial intelligence, initialization, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Macao (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.33)

Add feedback

5195825ee60d7efc1e42b7f3f3137040-Paper-Conference.pdf

Neural Information Processing SystemsFeb-13-2026, 04:26:41 GMT

initialization, invariant manifold, matrix, (13 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

6c351da15b5e8a743a21ee96a86e25df-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 05:58:33 GMT

classifier, linear classifier, neural network, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization

Neural Information Processing SystemsFeb-8-2026, 21:30:30 GMT

Gradient Descent (GD) is a simple yet efficient baseline algorithm for solving nonconvex optimization problems.

artificial intelligence, machine learning, nullx, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

19c145aaad40927c51f4d10eaa339c20-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 17:24:40 GMT

Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositionaltasks.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent

Neural Information Processing SystemsDec-27-2025, 04:37:32 GMT

Low-rank matrix factorization (LRMF) is a canonical problem in non-convex optimization, the objective function to be minimized is non-convex and even non-smooth, which makes the global convergence guarantee of gradient-based algorithm quite challenging.

fast global convergence, non-convex matrix factorization, varepsilon, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.95)

Add feedback

Filters

Collaborating Authors

small initialization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

From Condensation to Rank Collapse: ATwo-Stage Analysis of Transformer Training Dynamics

Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent

21c426323068204f4199c490d730e88e-Paper-Conference.pdf

Towards Understanding the Condensation of Neural Networks at Initial Training

f02f1185b97518ab5bd7ebde466992d3-Paper-Conference.pdf

5195825ee60d7efc1e42b7f3f3137040-Paper-Conference.pdf

6c351da15b5e8a743a21ee96a86e25df-Paper.pdf

Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization

19c145aaad40927c51f4d10eaa339c20-Paper-Conference.pdf

Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent