Goto

Collaborating Authors

 cql


A Detailed Proof 1 A.1 Proof of Theorem 4.1

Neural Information Processing Systems

We can compute the fixed point of the recursion in Equation A.2 and get the following estimated Then we compare these two gaps. To utilize the Eq. 4 for policy optimization, following the analysis in the Section 3.2 in Kumar et al. By choosing different regularizer, there are a variety of instances within CQL family. B.36 called CFCQL( H) which is the update rule we used: In discrete action space, we train a three-level MLP network with MLE loss. In continuous action space, we use the method of explicit estimation of behavior density in Wu et al.




Appendices A Discussion of CQL Variants

Neural Information Processing Systems

We derive several variants of CQL in Section 3.2. Here, we discuss these variants on more detail and describe their specific properties. Equation 3. To start, we define the notion of "robust expectation": for any function Q-function that penalizes the variance of Q-function predictions under the distribution ˆ P . To recap, Theorem 3.4 shows that the CQL backup operator increases the difference between expected Q-value at in-distribution ( Function approximation may give rise to erroneous Q-values at OOD actions. "generalization" or the coupling effects of the function approximator may be heavily influenced by the This problem persists even when a large number of samples (e.g.


We have reformulated the theoretical statements to position 1 them with previous work, clarified notation and surface-level inconsistencies in theoretical statements, added

Neural Information Processing Systems

We thank the reviewers for their constructive feedback. Fu et al. (2020) have added results for A WR, BCQ, REM and AlgaeDICE in R1/[R3, W1]: Policy improvement result, what is CQL doing. Based on R1/R3's requests, we have added a new This follows from applying and extending the tools in Achiam et al. 2017. Note the similarity with Thm. 2 in (Laroche Since our submission, Nair et al. 2020 and Ghasemipour et al. 2020 have discussed We will add extended discussion of this point in the paper. We now explicitly indicate dimensions of vectors, matrices and scalars.



A Extended Related Work

Neural Information Processing Systems

We extend our related work section on the following related topics as suggested by reviewers: A.1 Discussion on Ensembles and Distributional RL In our main text, the estimated values for extremely o.o.d. On the one hand, it's clear that such an assumption holds for the tabular settings, that un-visited On the other hand, we acknowledge it as a mild assumption that there always exists o.o.d. The key insight we want to emphasize in Section 4.3.1 is that for frequently visited state-action pairs, Network Structure Our implementation of TD3, BCQ and CQL are based on code released by the authors, without changing hyper-parameters. Our code is provided in the supplementary materials, and will be made public available. Our implementation of BCQ and CQL are both based on the code provided by the authors.