endfor Updatecriticwithφi φi αφ φiLi Updateactoriwithθi θi+αθ θi JiPG+λ1 PN j=1J i,j TS

Neural Information Processing Systems 

We trained each agent i with online Q-learning [33] on the Qi(ai,s) table using Boltzmann exploration [18]. The Boltzmann temperature is fixed to 1 and we set the learning rate to 0.05 and the discount factor to0.99. Atinitialisation,thetarget'sand ball'svertical position is fixed, their horizontal positions are random. In all of our experiments, we use the Adam optimizer [19] to perform parameter updates. We use a buffer-size of106 entriesandabatch-sizeof1024.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found