195f15384c2a79cedf293e4a847ce85c-AuthorFeedback.pdf

Neural Information Processing Systems 

The learning rate α for the baseline was chosen to be the best value from3 [0.1,0.2,0.3,0.4], while our model hyperparameters (the learning rateαh for h, and the number of binsnb for4 the return version of HCA) were selected informally to beα = 0.3,αb = 0.4,nb = 3for the results in Figure 1, and5 nb = 10elsewhere. The key is that we are learningh, π and V at the same time, but their learning dynamics are different. In28 particular h moves quicker thanπ (regardless of learning rate) as it is updated towards1 for any observed sample.29 Nowconsider someinterimV(y)<0. Note thatthereturn version doesn'tsufferfrom this. That'satypo and should say *lower* (or equal) entropy.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found