AITopics

2410.19931

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-26-2024

Transformers Learn Temporal Difference Methods for In-Context Reinforcement Learning

Wang, Jiuqi, Blaser, Ethan, Daneshmand, Hadi, Zhang, Shangtong

In-context learning refers to the learning ability of a model during inference time without adapting its parameters. The input (i.e., prompt) to the model (e.g., transformers) consists of both a context (i.e., instance-label pairs) and a query instance. The model is then able to output a label for the query instance according to the context during inference. A possible explanation for in-context learning is that the forward pass of (linear) transformers implements iterations of gradient descent on the instance-label pairs in the context. In this paper, we prove by construction that transformers can also implement temporal difference (TD) learning in the forward pass, a phenomenon we refer to as in-context TD. We demonstrate the emergence of in-context TD after training the transformer with a multi-task TD algorithm, accompanied by theoretical analysis. Furthermore, we prove that transformers are expressive enough to implement many other policy evaluation algorithms in the forward pass, including residual gradient, TD with eligibility trace, and average-reward TD.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2405.13861

Country: North America > United States (0.28)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

arXiv.org Machine LearningNov-17-2023

On the impact of activation and normalization in obtaining isometric embeddings at initialization

Joudaki, Amir, Daneshmand, Hadi, Bach, Francis

In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.

artificial intelligence, deep learning, machine learning, (17 more...)

2305.18399

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-9-2023

Transformers learn to implement preconditioned gradient descent for in-context learning

Ahn, Kwangjun, Cheng, Xiang, Daneshmand, Hadi, Sra, Suvrit

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.

artificial intelligence, machine learning, transformer, (19 more...)

2306.00297

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Artificial IntelligenceOct-3-2023

Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Meterez, Alexandru, Joudaki, Amir, Orabona, Francesco, Immer, Alexander, Rätsch, Gunnar, Daneshmand, Hadi

Normalization layers are one of the key building blocks for deep neural networks. Several theoretical studies have shown that batch normalization improves the signal propagation, by avoiding the representations from becoming collinear across the layers. However, results on mean-field theory of batch normalization also conclude that this benefit comes at the expense of exploding gradients in depth. Motivated by these two aspects of batch normalization, in this study we pose the following question: "Can a batch-normalized network keep the optimal signal propagation properties, but avoid exploding gradients?" We answer this question in the affirmative by giving a particular construction of an Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded gradients at any depth. Based on Weingarten calculus, we develop a rigorous and non-asymptotic theory for this constructed MLP that gives a precise characterization of forward signal propagation, while proving that gradients remain bounded for linearly independent input samples, which holds in most practical settings. Inspired by our theory, we also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.

artificial intelligence, gradient explosion, machine learning, (14 more...)

2310.02012

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-20-2023

On Bridging the Gap between Mean Field and Finite Width in Deep Random Neural Networks with Batch Normalization

Joudaki, Amir, Daneshmand, Hadi, Bach, Francis

Mean field theory is widely used in the theoretical studies of neural networks. In this paper, we analyze the role of depth in the concentration of mean-field predictions, specifically for deep multilayer perceptron (MLP) with batch normalization (BN) at initialization. By scaling the network width to infinity, it is postulated that the mean-field predictions suffer from layer-wise errors that amplify with depth. We demonstrate that BN stabilizes the distribution of representations that avoids the error propagation of mean-field predictions. This stabilization, which is characterized by a geometric mixing property, allows us to establish concentration bounds for mean field predictions in infinitely-deep neural networks with a finite width.

artificial intelligence, machine learning, null, (17 more...)

2205.13076

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceFeb-12-2023

Polynomial-time Sparse Measure Recovery: From Mean Field Theory to Algorithm Design

Daneshmand, Hadi, Bach, Francis

Mean field theory has provided theoretical insights into various algorithms by letting the problem size tend to infinity. We argue that the applications of mean-field theory go beyond theoretical insights as it can inspire the design of practical algorithms. Leveraging mean-field analyses in physics, we propose a novel algorithm for sparse measure recovery. For sparse measures over $\mathbb{R}$, we propose a polynomial-time recovery method from Fourier moments that improves upon convex relaxation methods in a specific parameter regime; then, we demonstrate the application of our results for the optimization of particular two-dimensional, single-layer neural networks in realizable settings.

algorithm, artificial intelligence, machine learning, (16 more...)

2204.07879

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceFeb-9-2023

Efficient displacement convex optimization with particle gradient descent

Daneshmand, Hadi, Lee, Jason D., Jin, Chi

Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Concretely, for Lipschitz displacement convex functions defined on probability over $\mathbb{R}^d$, we prove that $O(1/\epsilon^2)$ particles and $O(d/\epsilon^4)$ computations are sufficient to find the $\epsilon$-optimal solutions. We further provide improved complexity bounds for optimizing smooth displacement convex functions. We demonstrate the application of our results for function approximation with specific neural architectures with two-dimensional inputs.

artificial intelligence, convex function, machine learning, (16 more...)

2302.04753

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningJun-7-2021

Batch Normalization Orthogonalizes Representations in Deep Random Networks

Daneshmand, Hadi, Joudaki, Amir, Bach, Francis

This paper underlines a subtle property of batch-normalization (BN): Successive batch normalizations with random linear transformations make hidden representations increasingly orthogonal across layers of a deep neural network. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, under a mild assumption, we prove that the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width. This result has two main implications: 1) Theoretically, as the depth grows, the distribution of the representation -- after the linear layers -- contracts to a Wasserstein-2 ball around an isotropic Gaussian distribution. Furthermore, the radius of this Wasserstein ball shrinks with the width of the network. 2) In practice, the orthogonality of the representations directly influences the performance of stochastic gradient descent (SGD). When representations are initially aligned, we observe SGD wastes many iterations to orthogonalize representations before the classification. Nevertheless, we experimentally show that starting optimization from orthogonal representations is sufficient to accelerate SGD, with no need for BN.

deep learning, neural network, representation, (15 more...)

2106.0397

Country:

North America > United States (0.14)
Europe (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningMay-27-2018

Towards a Theoretical Understanding of Batch Normalization

Kohler, Jonas, Daneshmand, Hadi, Lucchi, Aurelien, Zhou, Ming, Neymeyr, Klaus, Hofmann, Thomas

Normalization techniques such as Batch Normalization have been applied very successfully for training deep neural networks. Yet, despite its apparent empirical benefits, the reasons behind the success of Batch Normalization are mostly hypothetical. We thus aim to provide a more thorough theoretical understanding from an optimization perspective. Our main contribution towards this goal is the identification of various problem instances in the realm of machine learning where, under certain assumptions, Batch Normalization can provably accelerate optimization with gradient-based methods. We thereby turn Batch Normalization from an effective practical heuristic into a provably converging algorithm for these settings. Furthermore, we substantiate our analysis with empirical evidence that suggests the validity of our theoretical results in a broader context.

assumption, deep learning, neural network, (19 more...)

1805.10694

Country: Asia (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)