A Additional details for experiment presented in Section 3 Motivation We trained each agent i with online Q-learning [ 33 ] on the Q

Neural Information Processing Systems 

The Boltzmann temperature is fixed to 1 and we set the learning rate to 0.05 and the discount factor to 0.99. To maximize their return, agents must therefore spread out and cover all landmarks. We use a discount factor γ of 0 .95 Since policies' hidden layers are of size 128 the corresponding value for During training, a policy is evaluated on a set of 10 different episodes every 100 learning steps. TeamReg is outperformed by all other algorithms when considering average return across agents.