678004486c119599ed7d199f47da043a-Supplemental.pdf
–Neural Information Processing Systems
Inthis section, we introduce some additional numerical experiments. Figure2: 2-dgridworld To add some randomness of the environment, we set that the states transit randomly. After the environment receivestheaction signal, thenextstate may generated byfollowing anyoftheother three actions with probability0.1 separately. The optimal policyencourages theagent totakethespecial jump and reach the terminal state. In the target policy,the agent will reach the terminal state as soon as possible butavoidtotakethespecial jump.
Neural Information Processing Systems
Feb-19-2026, 03:53:12 GMT