Goto

Collaborating Authors

 Reinforcement Learning


Entropic Desired Dynamics for Intrinsic Control: Supplemental Material Steven Hansen

Neural Information Processing Systems

While this is not close to the state-of-the-art in general (c.f. Figure 2 shows the effect of action entropy on exploratory behavior in Montezuma's Revenge. Number of unique avatar positions visited. Full training curves across all 6 Atari games are shown in Figure 1, including the random policy baseline. To ensure this didn't hamper performance, we At each state visited by the agent evaluator during training, the agent's state (consisting of the avatar's The full curves are included for completeness. The compute cluster we performed experiments on is heterogenous, and has features such as host-sharing, adaptive load-balancing, etc.




WeightedQMIX: ExpandingMonotonicValue FunctionFactorisationforDeepMulti-Agent ReinforcementLearning

Neural Information Processing Systems

In this paradigm of centralised training for decentralised execution, QMIX [25] is a popular Qlearning algorithm with state-of-the-art performance ontheStarCraft Multi-Agent Challenge [26]. QMIX represents the optimal joint action value function using a monotonicmixing function of per-agent utilities.



AnEfficientAsynchronousMethodforIntegrating EvolutionaryandGradient-basedPolicySearch

Neural Information Processing Systems

These have the opposite properties, with DRL having good sample efficiencyandpoor stability, while ESbeing vice versa. Recently,there havebeen attempts tocombine these algorithms, butthesemethods fullyrelyonsynchronous updatescheme, making it not ideal to maximize the benefits of the parallelism in ES.




Appendix: ContinuousDoublyConstrainedBatch ReinforcementLearning

Neural Information Processing Systems

However, numbers for BCQ and SAC are from our runs for all tasks. These plots show that, in the vast majority of environments, CDC exhibits consistently better performance across different seeds/iterations.


ContinuousDoublyConstrainedBatch ReinforcementLearning

Neural Information Processing Systems

Thelimited datainbatchRLproduces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.