dual formulation
Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction
He, Yiting, Liu, Zhishuai, Wang, Weixin, Xu, Pan
Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with $f$-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.
Checklist 1. For all authors (a)
Do the main claims made in the abstract and introduction accurately reflect the paper's If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Y es] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Did you include the total amount of compute and the type of resources used (e.g., type Did you include any new assets either in the supplemental material or as a URL? [N/A] Did you discuss whether and how consent was obtained from people whose data you're If you used crowdsourcing or conducted research with human subjects... (a) The full version of the Table 1 is given in Table 3. That is, the following relationships hold: 2g p uq " sup This formulation can be found in Lemma 3.1 of Jenatton et al. First we compute the gradient g p uq " รฟ A.7 Log Sum First, we compute the derivative g puq " log p? u ` ฮต q รนรฑ g " 0, (46) which gives the inverse mapping? However, it is separable, and in one dimension we have g p uq " null tu ฤ 0u . " au ฤ m uq, (69) where ConvpAq is the convex hull of the set A. Similarly define s S Running for 1000 epochs, for example, gets the fraction of nonzeros down to around 0.1, at a slight expense of accuracy.
e22312179bf43e61576081a2f250f845-AuthorFeedback.pdf
We would like to thank the reviewers for their positive and helpful feedback. Typo: The result of Theorem 1 is on the expected norm of ฮธ indeed, thank you for pointing this out. Derivations can be found in Appendix A. Y et, and although it leads to a related algorithm, our approach is different. We will make sure to insist on the points discussed in this rebuttal in a revised version of the paper. Fastest rates for stochastic mirror descent methods.
Scalable and adaptive prediction bands with kernel sum-of-squares
Allain, Louis, da Veiga, Sรฉbastien, Staber, Brian
Conformal Prediction (CP) is a popular framework for constructing prediction bands with valid coverage in finite samples, while being free of any distributional assumption. A well-known limitation of conformal prediction is the lack of adaptivity, although several works introduced practically efficient alternate procedures. In this work, we build upon recent ideas that rely on recasting the CP problem as a statistical learning problem, directly targeting coverage and adaptivity. This statistical learning problem is based on reproducible kernel Hilbert spaces (RKHS) and kernel sum-of-squares (SoS) methods. First, we extend previous results with a general representer theorem and exhibit the dual formulation of the learning problem. Crucially, such dual formulation can be solved efficiently by accelerated gradient methods with several hundreds or thousands of samples, unlike previous strategies based on off-the-shelf semidefinite programming algorithms. Second, we introduce a new hyperparameter tuning strategy tailored specifically to target adaptivity through bounds on test-conditional coverage. This strategy, based on the Hilbert-Schmidt Independence Criterion (HSIC), is introduced here to tune kernel lengthscales in our framework, but has broader applicability since it could be used in any CP algorithm where the score function is learned. Finally, extensive experiments are conducted to show how our method compares to related work. All figures can be reproduced with the accompanying code.
Nested Stochastic Gradient Descent for (Generalized) Sinkhorn Distance-Regularized Distributionally Robust Optimization
Yang, Yufeng, Zhou, Yi, Lu, Zhaosong
Distributionally robust optimization (DRO) is a powerful technique to train robust models against data distribution shift. This paper aims to solve regularized nonconvex DRO problems, where the uncertainty set is modeled by a so-called generalized Sinkhorn distance and the loss function is nonconvex and possibly unbounded. Such a distance allows to model uncertainty of distributions with different probability supports and divergence functions. For this class of regularized DRO problems, we derive a novel dual formulation taking the form of nested stochastic programming, where the dual variable depends on the data sample. To solve the dual problem, we provide theoretical evidence to design a nested stochastic gradient descent (SGD) algorithm, which leverages stochastic approximation to estimate the nested stochastic gradients. We study the convergence rate of nested SGD and establish polynomial iteration and sample complexities that are independent of the data size and parameter dimension, indicating its potential for solving large-scale DRO problems. We conduct numerical experiments to demonstrate the efficiency and robustness of the proposed algorithm.