convergence theory
Asynchronous Coordinate Descent under More Realistic Assumptions
Asynchronous-parallel algorithms have the potential to vastly speed up algorithms by eliminating costly synchronization. However, our understanding of these algorithms is limited because the current convergence theory of asynchronous block coordinate descent algorithms is based on somewhat unrealistic assumptions. In particular, the age of the shared optimization variables being used to update blocks is assumed to be independent of the block being updated. Additionally, it is assumed that the updates are applied to randomly chosen blocks. In this paper, we argue that these assumptions either fail to hold or will imply less efficient implementations.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Sweden > Västerbotten County > Umeå (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
CCCP is Frank-Wolfe in disguise
This paper uncovers a simple but rather surprising connection: it shows that the well-known convex-concave procedure (CCCP) and its generalization to constrained problems are both special cases of the Frank-Wolfe (FW) method. This connection not only provides insight of deep (in our opinion) pedagogical value, but also transfers the recently discovered convergence theory of nonconvex Frank-Wolfe methods immediately to CCCP, closing a long-standing gap in its non-asymptotic convergence theory. We hope the viewpoint uncovered by this paper spurs the transfer of other advances made for FW to both CCCP and its generalizations.
Stochastic Anderson Mixing for Nonconvex Stochastic Optimization
Anderson mixing (AM) is an acceleration method for fixed-point iterations. Despite its success and wide usage in scientific computing, the convergence theory of AM remains unclear, and its applications to machine learning problems are not well explored. In this paper, by introducing damped projection and adaptive regularization to the classical AM, we propose a Stochastic Anderson Mixing (SAM) scheme to solve nonconvex stochastic optimization problems. Under mild assumptions, we establish the convergence theory of SAM, including the almost sure convergence to stationary points and the worst-case iteration complexity. Moreover, the complexity bound can be improved when randomly choosing an iterate as the output. To further accelerate the convergence, we incorporate a variance reduction technique into the proposed SAM. We also propose a preconditioned mixing strategy for SAM which can empirically achieve faster convergence or better generalization ability. Finally, we apply the SAM method to train various neural networks including the vanilla CNN, ResNets, WideResNet, ResNeXt, DenseNet and LSTM. Experimental results on image classification and language model demonstrate the advantages of our method.
On the Convergence Theory for Hessian-Free Bilevel Algorithms
Bilevel optimization has arisen as a powerful tool in modern machine learning. However, due to the nested structure of bilevel optimization, even gradient-based methods require second-order derivative approximations via Jacobian-or/and Hessian-vector computations, which can be costly and unscalable in practice. Recently, Hessian-free bilevel schemes have been proposed to resolve this issue, where the general idea is to use zeroth-or first-order methods to approximate the full hypergradient of the bilevel problem. However, we empirically observe that such approximation can lead to large variance and unstable training, but estimating only the response Jacobian matrix as a partial component of the hypergradient turns out to be extremely effective. To this end, we propose a new Hessian-free method, which adopts the zeroth-order-like method to approximate the response Jacobian matrix via taking difference between two optimization paths. Theoretically, we provide the convergence rate analysis for the proposed algorithms, where our key challenge is to characterize the approximation and smoothness properties of the trajectory-dependent estimator, which can be of independent interest. This is the first known convergence rate result for this type of Hessian-free bilevel algorithms. Experimentally, we demonstrate that the proposed algorithms outperform baseline bilevel optimizers on various bilevel problems. Particularly, in our experiment on few-shot meta-learning with ResNet-12 network over the miniImageNet dataset, we show that our algorithm outperforms baseline meta-learning algorithms, while other baseline bilevel optimizers do not solve such meta-learning problems within a comparable time frame.
On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning
We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of \textit{stochastic} policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $\epsilon$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used at test time. Finally, we empirically compare SG-MRL and MAML in several deep RL environments.
Asynchronous Coordinate Descent under More Realistic Assumptions
Asynchronous-parallel algorithms have the potential to vastly speed up algorithms by eliminating costly synchronization. However, our understanding of these algorithms is limited because the current convergence theory of asynchronous block coordinate descent algorithms is based on somewhat unrealistic assumptions. In particular, the age of the shared optimization variables being used to update blocks is assumed to be independent of the block being updated. Additionally, it is assumed that the updates are applied to randomly chosen blocks. In this paper, we argue that these assumptions either fail to hold or will imply less efficient implementations.
Schedulers for Schedule-free: Theoretically inspired hyperparameters
Pun, Yuen-Man, Buchholz, Matthew, Gower, Robert M.
The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > Canada > British Columbia (0.04)
Quantum Kernel Methods: Convergence Theory, Separation Bounds and Applications to Marketing Analytics
Sáez-Ortuño, Laura, Forgas-Coll, Santiago, Ferrara, Massimiliano
Quantum machine learning (QML) has emerged as one of the most promising near-term applications of quantum computing, with quantum kernel methods representing a particularly elegant bridge between classical machine learning theory and quantum computational advantages [17, 8, 18]. The fundamental insight underlying quantum kernels is that quantum circuits can efficiently compute inner products in exponentially large Hilbert spaces, potentially capturing data relationships that are intractable for classical methods [12]. This study presents an end-to-end feasibility test of a quantum-enhanced method for supervised classification on a real consumer dataset, evaluated with ROC analysis. We examine whether shallow, NISQ-compatible quantum embeddings and kernels can improve class separability and enable threshold-based operation along the ROC curve without retraining. We report 0.7790 accuracy, 0.7647 precision, 0.8609 recall, 0.8100 F1, and 0.83 AUC. Recent work has demonstrated empirical success of quantum support vector machines (Q-SVM) in various domains, from particle physics [21] to bioinformatics [19]. However, the theoretical foundations explaining when and why quantum advantage emerges remain incomplete. While pioneering studies by Schuld and Killoran [17] established the mathematical equivalence between variational quantum circuits and kernel methods, and Liu et al. [12] proved unconditional quantum advantage for specifically constructed problems, several fundamental questions remain open: 1. What are the convergence guarantees for variational quantum kernel optimization? 1
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Italy > Calabria (0.04)
- Marketing (0.41)
- Information Technology > Services (0.41)