A Additional details for experiment presented in Section 3 Motivation We trained each agent i with online Q-learning [33] on the Q
–Neural Information Processing Systems
The Boltzmann temperature is fixed to 1 and we set the learning rate to 0.05 and the discount factor to 0.99. After each learning episode we evaluate the current greedy policy on 10 episodes and report the mean return. Curves are averaged over 20 seeds and the shaded area represents the standard error. SPREAD (Figure 4a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). To maximize their return, agents must therefore spread out and cover all landmarks.
Neural Information Processing Systems
May-31-2025, 13:38:28 GMT