Goto

Collaborating Authors

 Reinforcement Learning




WhySoPessimistic? EstimatingUncertaintiesforOffline RLthroughEnsembles,and WhyTheirIndependenceMatters

Neural Information Processing Systems

Through theoretical analyses andconstruction ofexamples intoyMDPs,wedemonstrate thatshared pessimistic targets can paradoxically lead to value estimates that are effectively optimistic.



Reviewer 1

Neural Information Processing Systems

We appreciate R1's recognition of the novelty of our contribution to MARL and the potential impact on a We address R1's two concerns below. "give-reward" actions are direct applications of conventional RL (which have been applied to multi-agent incentivization We appreciate R2's positive feedback on our quantitative results and we are glad that our behavioral Figure 6b where the agent gives nonzero reward for "fire cleaning beam but miss" after 40k steps, one reason is that the Figure 6a), so it may have "forgotten" the difference between successful and unsuccessful usage of the cleaning beam. As demonstrated more clearly in the Escape Room results (e.g. We thank R3 for recognizing our contribution to the general class of opponent-shaping algorithms. Prisoner's Dilemma is fully observable).