online trial
5ec4e93f2cec19d47ef852a0e1fb2c48-Supplemental-Conference.pdf
A.1 AdditionalMethodJustification The key idea ofQWALE is to lead the agent to nearby states within distribution of the prior data if it is out of distribution and to nearby states closer to task completion if in distribution. This problem has been studied instochastic optimal control, particularly REPS [Peters etal., 2010]. Weusethisupdatefor all our evaluated methods online in order to improve stability. For all experiments using prior data collected through RL, the agent was initialized at test time with the pretrained policyand critic. The details for this environment are in [Sharma et al., 2021b].
A Appendix A.1 Additional Method Justification The key idea of Q
This problem has been studied in stochastic optimal control, particularly REPS [Peters et al., 2010]. In our experiments, we use soft actor-critic [Haarnoja et al., 2018] as our base RL algorithm. The policy and critic networks are MLPs with 2 fully-connected hidden layers of size 256. Following [Sharma et al., 2021b], we use a biased TD update, where For all experiments using prior data collected through RL, the agent was initialized at test time with the pretrained policy and critic. The details for this environment are in [Sharma et al., 2021b].