Goto

Collaborating Authors

 Agents








ad7ed5d47b9baceb12045a929e7e2f66-Supplemental.pdf

Neural Information Processing Systems

A.1 Costforincentivization We justify the way in which LIO accounts for the cost of incentivization as follows. However, both the reward-giverand recipients require sufficient time tolearn the effect ofincentives,which means that too large anฮฑ would lead to the degenerate result ofrฮทi = 0. On the other extreme, ฮฑ = 0means there isno penalty and may result inprofligate incentivization that serves no useful purpose. Let ฮธi for i {1,2} denote each agent's probability of taking the cooperative action. Each plot has afixed value for the incentive givenfortheotheraction. Each agent observesallagents' positions andcanmoveamong thethree available states: lever, start, and door.



Reviewer 1

Neural Information Processing Systems

We appreciate R1's recognition of the novelty of our contribution to MARL and the potential impact on a We address R1's two concerns below. "give-reward" actions are direct applications of conventional RL (which have been applied to multi-agent incentivization We appreciate R2's positive feedback on our quantitative results and we are glad that our behavioral Figure 6b where the agent gives nonzero reward for "fire cleaning beam but miss" after 40k steps, one reason is that the Figure 6a), so it may have "forgotten" the difference between successful and unsuccessful usage of the cleaning beam. As demonstrated more clearly in the Escape Room results (e.g. We thank R3 for recognizing our contribution to the general class of opponent-shaping algorithms. Prisoner's Dilemma is fully observable).