reviewer 4
Paper: Generalization of Reinforcement Learners with Working and Episodic Memory
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMP ALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
Reviewer
Lower bound on regret: Assuming you mean Theorem 3 here - the theorem is correct as stated. We however use the correct defn. in all of our proofs. We mean Lipschitz continuity, as we want close-by models to imply the solution values are close. The use of this term is meant to follow the notation in Bottou et. It is defined in the formal statement of Theorem 2 (Theorem 5 in the appendix).
Reviewer 1: " the statement in line 153 in the neighbourhood of z nullJ i (z), f (x)null = 0. "
We are grateful to the reviewers for the insightful comments on our submission. All the minor comments will also be addressed in the revised manuscript. We will update line 153 to " The domain of z can be easily adjusted by translation and dilation after the training process. Reviewer 1: "emphasize the need for gradient evaluations when you state the observation." " .......The first and fourth columns show the relationship between the output and NN is very efficient compared to the evaluating the FEM model in Case (ii).
's interpretation,we'll first contrast our work with Gelada's
We thank all reviewers for their time and comments. Here are some general responses followed by individual ones. ACE, it would just be an actor-critic analogue of Gelada's Q-learning approach as This has not been done in RL and cannot be handled by ACE. We will include a comparison with TD3 in the next version of the paper as shown by Figure 1. Somewhat surprisingly, TD3 does not work better than DDPG in our setup.
execution of SEVIR required several novel ideas and insights, including recognition of a gap in ML-ready weather
Thank you to each reviewer for your helpful feedback on our paper. Below we provide our reasoning for several selected points. Due to page limits, only a portion of the updated figure is shown below. TrajGRU) would be out of scope (and well over page count). The baselines we provide show that depending on your choice of loss function, certain axes of "goodness" are brought We will add more discussion along these lines which address "what is done and why".