d5ff135377d39f1de7372c95c74dd962-Supplemental.pdf

Neural Information Processing Systems 

Ifthepickedlabeliscorrect, theagentgetsarewardofr = 0,andtheepisode ends, and ifthe picked label isincorrect, then the agent gets areward ofr = 1,and the episode continues to the next time-step (where it must guess another label for thesameimage). For the variant labelled "Adaptive", we train a classifierpθ(y|x)on the training dataset of images with the same architecture as the DQN agent. Clearly,thepolicy"alwaysswitch" is optimal inMA and so is -optimal under the distribution on MDPs. The proof is a simple modification of the construction in Proposition 5.1. Effectively, this policy either visits the left-most state or the rightmost state inthe final level.

Duplicate Docs Excel Report

Similar Docs  Excel Report  more

TitleSimilaritySource
None found