AITopics

Country: North America > United States > New York (0.28)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

Ioannis Panageas, Georgios Piliouras, Xiao Wang

First-order methods almost always avoid saddle points: The case of vanishing step-sizes

Neural Information Processing SystemsFeb-12-2026, 00:11:36 GMT

Neural Information Processing Systems http://nips.cc/

converge, saddle point, theorem, (13 more...)

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)
North America > Canada > Quebec > Montreal (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.32)

Neural Information Processing SystemsFeb-7-2026, 13:24:57 GMT

118bd558033a1016fcc82560c65cca5f-Supplemental.pdf

classifier, dataset, training environment, (17 more...)

Country:

Africa (0.14)
Oceania (0.04)
Europe > Netherlands (0.04)
Asia (0.04)

Genre: Research Report (0.93)

Industry:

Health & Medicine (0.47)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Constantinos Daskalakis, Ioannis Panageas

The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization

Neural Information Processing SystemsNov-20-2025, 14:32:28 GMT

When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA).

artificial intelligence, critical point, machine learning, (16 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Kvalheim, Matthew D., Sontag, Eduardo D.

Autoencoding Dynamics: Topological Limitations and Capabilities

arXiv.org Artificial IntelligenceNov-12-2025

Given a "data manifold" $M\subset \mathbb{R}^n$ and "latent space" $\mathbb{R}^\ell$, an autoencoder is a pair of continuous maps consisting of an "encoder" $E\colon \mathbb{R}^n\to \mathbb{R}^\ell$ and "decoder" $D\colon \mathbb{R}^\ell\to \mathbb{R}^n$ such that the "round trip" map $D\circ E$ is as close as possible to the identity map $\mbox{id}_M$ on $M$. We present various topological limitations and capabilites inherent to the search for an autoencoder, and describe capabilities for autoencoding dynamical systems having $M$ as an invariant manifold.

artificial intelligence, deep learning, machine learning, (19 more...)

2511.04807

Country: North America > United States > Maryland (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Karagodin, Nikita, Ge, Shu, Polyanskiy, Yury, Rigollet, Philippe

Normalization in Attention Dynamics

arXiv.org Artificial IntelligenceNov-12-2025

We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.

large language model, machine learning, normalization, (20 more...)

2510.22026

Country: Europe (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Crăciun, Alexandru, Ghoshdastidar, Debarghya

Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations

arXiv.org Artificial IntelligenceOct-29-2025

The theory of training deep networks has become a central question of modern machine learning and has inspired many practical advancements. In particular, the gradient descent (GD) optimization algorithm has been extensively studied in recent years. A key assumption about GD has appeared in several recent works: the \emph{GD map is non-singular} -- it preserves sets of measure zero under preimages. Crucially, this assumption has been used to prove that GD avoids saddle points and maxima, and to establish the existence of a computable quantity that determines the convergence to global minima (both for GD and stochastic GD). However, the current literature either assumes the non-singularity of the GD map or imposes restrictive assumptions, such as Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for deep ReLU networks with the cross-entropy loss) and restricts the analysis to GD with small step-sizes. In this paper, we investigate the neural network map as a function on the space of weights and biases. We also prove, for the first time, the non-singularity of the gradient descent (GD) map on the loss landscape of realistic neural network architectures (with fully connected, convolutional, or softmax attention layers) and piecewise analytic activations (which includes sigmoid, ReLU, leaky ReLU, etc.) for almost all step-sizes. Our work significantly extends the existing results on the convergence of GD and SGD by guaranteeing that they apply to practical neural network settings and has the potential to unlock further exploration of learning dynamics.

artificial intelligence, machine learning, measure zero, (19 more...)

2510.24466

Country: North America > United States (0.30)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Nikolaou, Giorgos, Mencattini, Tommaso, Crisostomi, Donato, Santilli, Andrea, Panagakis, Yannis, Rodolà, Emanuele

Language Models are Injective and Hence Invertible

arXiv.org Artificial IntelligenceOct-22-2025

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

large language model, machine learning, natural language, (21 more...)