acrobot
cf9a242b70f45317ffd281241fa66502-AuthorFeedback.pdf
We thank the reviewers for their close reading of the paper and helpful feedback. Forexample, one can use thedensity ratio estimates7 provided by DualDICE to modify (importance-weight) the off-policy data distribution before passing it to a policy8 gradient orQ-learning method. The figures are overall too small... In Figure 2 the x axis label is missing. The x-axis is training step.
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
about the assumptions, related work, and evaluation. 2 CONTENT
We thank all reviewers for their valuable time and feedback. Note that multiple recent works (offline and online) simply assume a linear MDP with known features in analysis. KL-divergence formulation to impose different distribution priors when available. We agree about Section 3.3 and in retrospect should have saved We will remove it and move some of the Appendix into the paper. We will add references to maximum-entropy approaches in RL and IRL.