Goto

Collaborating Authors

 Agents




endfor Updatecriticwithฯ†i ฯ†i ฮฑฯ† ฯ†iLi Updateactoriwithฮธi ฮธi+ฮฑฮธ ฮธi JiPG+ฮป1 PN j=1J i,j TS

Neural Information Processing Systems

We trained each agent i with online Q-learning [33] on the Qi(ai,s) table using Boltzmann exploration [18]. The Boltzmann temperature is fixed to 1 and we set the learning rate to 0.05 and the discount factor to0.99. Atinitialisation,thetarget'sand ball'svertical position is fixed, their horizontal positions are random. In all of our experiments, we use the Adam optimizer [19] to perform parameter updates. We use a buffer-size of106 entriesandabatch-sizeof1024.