Goto

Collaborating Authors

 hoeffding






Supplementary Policy

Neural Information Processing Systems

Let t(s, a)= Q(s, a) ˆQ (s, a)andFt(s, a)= rpeer+ maxb2 AQ(s0,b) ˆQ (s, a). In(A4), we robust DQNalgorithmwithpeersampling, inwhichtheoriginlossis`((s, a), y), also calibrated.


bc6d753857fe3dd4275dff707dedf329-Supplemental.pdf

Neural Information Processing Systems

In this setting, unlike basic setting, objective and constraints are not linear. We focus on a single state-action pairs,a, stage h, and objectivem. Similarly, in constrained settings, its estimated resource consumptions are underestimates of the true resource consumptions. B.5 BoundingtheBellmanerror We now provide an upper bound on the Bellman error which arises in the RHS of the regret decomposition(Proposition3.3). When neither failure events occur (probability 1 2δ), Proposition 3.3 upper bounds either of reward or consumption regret by In this section, we prove the main guarantee for the convex-concave setting.