Goto

Collaborating Authors

 Gradient Descent


Matrix Low-Rank Approximation For Policy Gradient Methods

arXiv.org Artificial Intelligence

Estimating a policy that maps states to actions is a central problem in reinforcement learning. Traditionally, policies are inferred from the so called value functions (VFs), but exact VF computation suffers from the curse of dimensionality. Policy gradient (PG) methods bypass this by learning directly a parametric stochastic policy. Typically, the parameters of the policy are estimated using neural networks (NNs) tuned via stochastic gradient descent. However, finding adequate NN architectures can be challenging, and convergence issues are common as well. In this paper, we put forth low-rank matrix-based models to estimate efficiently the parameters of PG algorithms. We collect the parameters of the stochastic policy into a matrix, and then, we leverage matrix-completion techniques to promote (enforce) low rank. We demonstrate via numerical studies how low-rank matrix-based policy models reduce the computational and sample complexities relative to NN models, while achieving a similar aggregated reward.


Efficient Ensembles Improve Training Data Attribution

arXiv.org Artificial Intelligence

Training data attribution (TDA) methods aim to quantify the influence of individual training data points on the model predictions, with broad applications in data-centric AI, such as mislabel detection, data selection, and copyright compensation. However, existing methods in this field, which can be categorized as retraining-based and gradient-based, have struggled with the trade-off between computational efficiency and attribution efficacy. Retraining-based methods can accurately attribute complex non-convex models but are computationally prohibitive, while gradient-based methods are efficient but often fail for non-convex models. Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution efficacy. However, this approach remains impractical for very large-scale applications. In this work, we discover that expensive, fully independent training is unnecessary for ensembling the gradient-based methods, and we propose two efficient ensemble strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE, alternative to naive independent ensemble. These strategies significantly reduce training time (up to 80%), serving time (up to 60%), and space cost (up to 80%) while maintaining similar attribution efficacy to the naive independent ensemble. Our extensive experimental results demonstrate that the proposed strategies are effective across multiple TDA methods on diverse datasets and models, including generative settings, significantly advancing the Pareto frontier of TDA methods with better computational efficiency and attribution efficacy.


Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators

arXiv.org Artificial Intelligence

Meta-learning has been proposed as a promising machine learning topic in recent years, with important applications to image classification, robotics, computer games, and control systems. In this paper, we study the problem of using meta-learning to deal with uncertainty and heterogeneity in ergodic linear quadratic regulators. We integrate the zeroth-order optimization technique with a typical meta-learning method, proposing an algorithm that omits the estimation of policy Hessian, which applies to tasks of learning a set of heterogeneous but similar linear dynamic systems. The induced meta-objective function inherits important properties of the original cost function when the set of linear dynamic systems are meta-learnable, allowing the algorithm to optimize over a learnable landscape without projection onto the feasible set. We provide a convergence result for the exact gradient descent process by analyzing the boundedness and smoothness of the gradient for the meta-objective, which justify the proposed algorithm with gradient estimation error being small. We also provide a numerical example to corroborate this perspective.


Improved Generalization Bounds for Communication Efficient Federated Learning

arXiv.org Artificial Intelligence

This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.


Matrix Low-Rank Trust Region Policy Optimization

arXiv.org Artificial Intelligence

Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters are optimized using stochastic gradient descent. However, PG methods are prone to large policy updates that can render learning inefficient. Trust region algorithms, like Trust Region Policy Optimization (TRPO), constrain the policy update step, ensuring monotonic improvements. This paper introduces low-rank matrix-based models as an efficient alternative for estimating the parameters of TRPO algorithms. By gathering the stochastic policy's parameters into a matrix and applying matrix-completion techniques, we promote and enforce low rank. Our numerical studies demonstrate that low-rank matrix-based policy models effectively reduce both computational and sample complexities compared to NN models, while maintaining comparable aggregated rewards.


Almost sure convergence rates of stochastic gradient methods under gradient domination

arXiv.org Artificial Intelligence

First-order methods to minimize an objective function f have played a central role in the success of machine learning. This is accompanied by a growing interest in convergence statements particularly for stochastic gradient methods in different settings. To ensure convergence to the global optimum some kind of convexity assumption on the objective function is required. Especially in machine learning problems the standard (strong) convexity assumption is nearly never fulfilled. However, it is well known that achieving convergence towards global optima is still possible under a weaker assumption, namely under the gradient domination property, often referred to as Polyak-Lojasiewicz (PL)-inequality [45]. Also in reinforcement learning, multiple results have shown that the objective function for policy gradient methods, under specific parametrizations, fulfills a weak type of gradient domination and therefore provably achieve convergence towards the global optimum [11, 21, 33, 34]. Improving the understanding of rates and optimal step size choices for stochastic first order methods is of significant interest for the machine learning and reinforcement learning community.


Boosting Robustness by Clipping Gradients in Distributed Learning

arXiv.org Artificial Intelligence

Robust distributed learning consists in achieving good learning performance despite the presence of misbehaving workers. State-of-the-art (SOTA) robust distributed gradient descent (Robust-DGD) methods, relying on robust aggregation, have been proven to be optimal: Their learning error matches the lower bound established under the standard heterogeneity model of $(G, B)$-gradient dissimilarity. The learning guarantee of SOTA Robust-DGD cannot be further improved when model initialization is done arbitrarily. However, we show that it is possible to circumvent the lower bound, and improve the learning performance, when the workers' gradients at model initialization are assumed to be bounded. We prove this by proposing pre-aggregation clipping of workers' gradients, using a novel scheme called adaptive robust clipping (ARC). Incorporating ARC in Robust-DGD provably improves the learning, under the aforementioned assumption on model initialization. The factor of improvement is prominent when the tolerable fraction of misbehaving workers approaches the breakdown point. ARC induces this improvement by constricting the search space, while preserving the robustness property of the original aggregation scheme at the same time. We validate this theoretical finding through exhaustive experiments on benchmark image classification tasks.


The devil is in discretization discrepancy. Robustifying Differentiable NAS with Single-Stage Searching Protocol

arXiv.org Artificial Intelligence

Neural Architecture Search (NAS) has been widely adopted to design neural networks for various computer vision tasks. One of its most promising subdomains is differentiable NAS (DNAS), where the optimal architecture is found in a differentiable manner. However, gradient-based methods suffer from the discretization error, which can severely damage the process of obtaining the final architecture. In our work, we first study the risk of discretization error and show how it affects an unregularized supernet. Then, we present that penalizing high entropy, a common technique of architecture regularization, can hinder the supernet's performance. Therefore, to robustify the DNAS framework, we introduce a novel single-stage searching protocol, which is not reliant on decoding a continuous architecture. Our results demonstrate that this approach outperforms other DNAS methods by achieving 75.3% in the searching stage on the Cityscapes validation dataset and attains performance 1.1% higher than the optimal network of DCNAS on the non-dense search space comprising short connections. The entire training process takes only 5.5 GPU days due to the weight reuse, and yields a computationally efficient architecture. Additionally, we propose a new dataset split procedure, which substantially improves results and prevents architecture degeneration in DARTS.


LoQT: Low Rank Adapters for Quantized Training

arXiv.org Artificial Intelligence

Training of large neural networks requires significant computational resources. Despite advances using low-rank adapters and quantization, pretraining of models such as LLMs on consumer hardware has not been possible without model sharding, offloading during training, or per-layer gradient updates. To address these limitations, we propose LoQT, a method for efficiently training quantized models. LoQT uses gradient-based tensor factorization to initialize low-rank trainable weight matrices that are periodically merged into quantized full-rank weight matrices. Our approach is suitable for both pretraining and fine-tuning of models, which we demonstrate experimentally for language modeling and downstream task adaptation. We find that LoQT enables efficient training of models up to 7B parameters on a consumer-grade 24GB GPU. We also demonstrate the feasibility of training a 13B parameter model using per-layer gradient updates on the same hardware.


Multiplicative Reweighting for Robust Neural Network Optimization

arXiv.org Artificial Intelligence

Neural networks are widespread due to their powerful performance. However, they degrade in the presence of noisy labels at training time. Inspired by the setting of learning with expert advice, where multiplicative weight (MW) updates were recently shown to be robust to moderate data corruptions in expert advice, we propose to use MW for reweighting examples during neural networks optimization. We theoretically establish the convergence of our method when used with gradient descent and prove its advantages in 1d cases. We then validate our findings empirically for the general case by showing that MW improves the accuracy of neural networks in the presence of label noise on CIFAR-10, CIFAR-100 and Clothing1M. We also show the impact of our approach on adversarial robustness.