Goto

Collaborating Authors

 Agents




A Appendix

Neural Information Processing Systems

Notice that the Tabular CRR exp objective looks different from the learning rule defined by Eqn. 4. Following Eqn. 8, we see that whenever ยต In addition to being safe, we show that each iteration of CRR improves performance. To compute the performance of each agent, as reported in the Tables 2, 3,5, 6 and 7, we adopt the following procedure. We run each agent with three independent seeds. Agent snapshots are made every 50000 learner steps. As discussed in Sec. 3 using K-step returns can hurt the agent's performance To test this hypothesis, we evaluate CRR's (using the binary This objective is similar to the ones used in [27, 7].






concerns (C

Neural Information Processing Systems

We would like to thank all the reviewers for their constructive feedback. Citations refer to references in the paper and to the additional ones provided below. "I do agree that full information feedback is hard to expect in real scenarios,... However, the current Is there an application where this is a more realistic assumption?" The main motivation for our model is a setting that is in between the full information and bandit feedback. The proposed feedback model is also present in other practical applications.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper proposes a fairer optimization criterion, "regularized maximin", for centralized multi-agent MDPs. The idea, taken from the networking literature is elegant. The authors also propose an iterative optimization method that scales somewhat better than linear programming. The description of the transition model, lines 69-79, seems unnecessarily detailed.