AAppendix: Radial-DQN
–Neural Information Processing Systems
This isnot desirable as a will always overlap with itself and this is not something we wish to regularize. We initially usedc = 0.5 for simplicity, and did not tune as it worked well in our experiments. To understand the sensitivity of our algorithm to this choice ofc, we have conducted additional experiments onthetwogamesRoadrunner andBankHeist . We can see that c = 0.75 produces poor standard performance and we think it's because this requirement is too strict and the policy collapses to some simple policy with lower reward. The standard DQN was trained for 6M steps followed by 4.5M steps ofRADIAL training.
Neural Information Processing Systems
Feb-11-2026, 11:35:26 GMT