opt 1
Appendix
This is only for the ease of visualization. For linear MDP,In the generative model setting, Agarwal et al. [2020] shows model-based approach is still minimax optimal O((1 γ) 3SA/2)byusing as-absorbing MDP construction andthismodelbased technique is later reused for other more general settings (e.g. Itrequires high probability guarantee for learning optimal policyforany reward function, which is strictly stronger than the standard learning task that one only needs to learn to optimal policy for a fixed reward. B.2 GeneralabsorbingMDP The general absorbing MDP is defined as follows: for a fixed states and a sequence {ut}Ht=1, MDPMs,{ut}Ht=1 is identical toM for all states excepts, and state s is absorbing in the sense PMs,{ut}Ht=1(s|s,a) = 1 for all a, and the instantaneous reward at timet is rt(s,a) = ut for all a A. Also,weusetheshorthand notationVπ{s,ut} forVπs,Ms,{u We focus on the first claim. Later we shall remove the conditional onN (see SectionB.7). We use the singleton-absorbing MDPMs,{u?t}Ht=1 to handle the case (recallu?t
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
Supplementary Materials for " Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity " A Proofs of the Main Results
We first introduce some additional notations for convenience. Our proof mainly consists of the following steps: 1. Helper lemmas and a crude bound. See A.2, and more precisely, Lemmas A.9 and A.10. 3. Final bound for null -approximate NE value. See A.3. 4. Final bounds for null -NE policy. See A.5. 14 A.1 Important Lemmas We start with the component-wise error bounds.
Online Correlation Clustering: Simultaneously Optimizing All $\ell_p$-norms
Davies, Sami, Moseley, Benjamin, Newman, Heather
The $\ell_p$-norm objectives for correlation clustering present a fundamental trade-off between minimizing total disagreements (the $\ell_1$-norm) and ensuring fairness to individual nodes (the $\ell_\infty$-norm). Surprisingly, in the offline setting it is possible to simultaneously approximate all $\ell_p$-norms with a single clustering. Can this powerful guarantee be achieved in an online setting? This paper provides the first affirmative answer. We present a single algorithm for the online-with-a-sample (AOS) model that, given a small constant fraction of the input as a sample, produces one clustering that is simultaneously $O(\log^4 n)$-competitive for all $\ell_p$-norms with high probability, $O(\log n)$-competitive for the $\ell_\infty$-norm with high probability, and $O(1)$-competitive for the $\ell_1$-norm in expectation. This work successfully translates the offline "all-norms" guarantee to the online world. Our setting is motivated by a new hardness result that demonstrates a fundamental separation between these objectives in the standard random-order (RO) online model. Namely, while the $\ell_1$-norm is trivially $O(1)$-approximable in the RO model, we prove that any algorithm in the RO model for the fairness-promoting $\ell_\infty$-norm must have a competitive ratio of at least $Ω(n^{1/3})$. This highlights the necessity of a different beyond-worst-case model. We complement our algorithm with lower bounds, showing our competitive ratios for the $\ell_1$- and $\ell_\infty$- norms are nearly tight in the AOS model.
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > China > Beijing > Beijing (0.04)
Supplementary Materials for " Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity " A Proofs of the Main Results
We first introduce some additional notations for convenience. Our proof mainly consists of the following steps: 1. Helper lemmas and a crude bound. See A.2, and more precisely, Lemmas A.9 and A.10. 3. Final bound for null -approximate NE value. See A.3. 4. Final bounds for null -NE policy. See A.5. 14 A.1 Important Lemmas We start with the component-wise error bounds.
Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids
Lazzati, Filippo, Metelli, Alberto Maria
We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints. Inverse Reinforcement Learning (IRL) offers a promising solution by seeking to recover the expert's underlying reward function, which, if used for planning in the new settings, would reproduce the desired behavior. However, IRL is inherently ill-posed: multiple reward functions, forming the so-called feasible set, can explain the same observed behavior. Since these rewards may induce different policies in the new setting, in the absence of additional information, a decision criterion is needed to select which policy to deploy. In this paper, we propose a novel, principled criterion that selects the "average" policy among those induced by the rewards in a certain bounded subset of the feasible set. Remarkably, we show that this policy can be obtained by planning with the reward centroid of that subset, for which we derive a closed-form expression. We then present a provably efficient algorithm for estimating this centroid using an offline dataset of expert demonstrations only. Finally, we conduct numerical simulations that illustrate the relationship between the expert's behavior and the behavior produced by our method.
- Europe > Italy > Lombardy > Milan (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Florida > Orange County > Orlando (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)