Supplementary Material for BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning A Proofs of Theorems

Neural Information Processing Systems 

BAIL includes a regularization scheme to prevent over-fitting when generating the upper envelope. We refer to it as an "early stopping scheme" because the key idea is to return to the parameter values which gave the lowest validation error (see Section 7.8 of Goodfellow et al. Details are provided in Table 1. Table 1: BAIL hyper-parameters Parameter V alue discount rate γ 0. 99 horizon T 1000 training set size 0. 8 |B| validation set size 0. 2 |B| optimizer Adam [4] percentage p % 30% for BAIL 25% for Progressive BAIL upper envelope network structure 128 128 hidden units, ReLU activation learning rate 3 10 We use five MuJoCo environments, including Humanoid, which is the most challenging of the MuJoCo environments, and is not attempted in most other papers on batch DRL. The BCQ paper [2] also uses the same hyper-parameters for all experiments.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found