A Additional details for experiment presented in Section 3 Motivation We trained each agent i with online Q-learning [33] on the Q

May-31-2025, 13:38:28 GMT–Neural Information Processing Systems

The Boltzmann temperature is fixed to 1 and we set the learning rate to 0.05 and the discount factor to 0.99. After each learning episode we evaluate the current greedy policy on 10 episodes and report the mean return. Curves are averaged over 20 seeds and the shaded area represents the standard error. SPREAD (Figure 4a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). To maximize their return, agents must therefore spread out and cover all landmarks.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

Neural Information Processing Systems

May-31-2025, 13:38:28 GMT

Conferences PDF

Add feedback

Genre:
- Instructional Material > Online (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.86)
  - Representation & Reasoning > Agents (0.86)