Supplementary Material Learning to Play Sequential Games versus Unknown Opponents Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause (NeurIPS 2020)
–Neural Information Processing Systems
Our goal is to bound the learner's cumulative regret's are the actions chosen by the learner and In case we have k (,) L for some L> 0 then the result holds for L . 's according to the standard MW update algorithm which's, we follow the same proof steps as in proof of Theorem 1 to show that, with probability at least 1, the learner's regret can be bounded as R ( T) The corollary's statement then follows by observing that As discussed in Section 3.3, in a repeated Stackelberg game the decision Before bounding the leader' regret, recall that the algorithm resulting from Corollary 3 consists of In this section, we describe the experimental setup of Section 4.1. D ( y), (18) 16 Figure 3: Obtained rewards when the rangers know the poachers' model (OPT), use the proposed algorithm to update their patrol strategy online ( SU ( x, y) to maximize their own utility function. For the poachers' utility we use GP-UCB either converges to suboptimal solutions or displays a slower learning curve. In the case of more than one best response, ties are broken in an arbitrary but consistent manner.
Neural Information Processing Systems
Oct-3-2025, 02:45:58 GMT
- Technology: