678004486c119599ed7d199f47da043a-Supplemental.pdf

Neural Information Processing Systems 

Inthis section, we introduce some additional numerical experiments. Figure2: 2-dgridworld To add some randomness of the environment, we set that the states transit randomly. After the environment receivestheaction signal, thenextstate may generated byfollowing anyoftheother three actions with probability0.1 separately. The optimal policyencourages theagent totakethespecial jump and reach the terminal state. In the target policy,the agent will reach the terminal state as soon as possible butavoidtotakethespecial jump.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found