Agents
A Ablations
We find that past play greatly stabilizes the emergence of reciprocity in IPD. In cells containing another agent, we include the RUSP observations in these channels. In Figure 11 we show results when training with RUSP in these environments. Consistent with past work, the greedy baseline fails to reach a solution with high collective return. We use a distributed computing infrastructure used in Berner et al.
8caa38721906c1a0bb95c80fab33a893-Supplemental.pdf
V100 GPUs to train the models. Consortium and are licensed under a Creative Commons Attribution 4.0 License. Similarly, for evaluating the agent listener with a human speaker, each agent evaluates 400 human utterances in Fig 5b. In Fig 10, we present the results of the human evaluation on the text game. Sec 4.3, we show that agents trained using our method beat all prior baselines when paired with both The blue bars show the standard deviation across all agents present in the buffer.