Foremost, we would like to thank the reviewers and (S)ACs for giving up their time to conduct and organize the

Neural Information Processing Systems 

Results are presented in Fig a. Full details will be provided Please allow us to first justify the use of the HIL experiment. All of the following points will be clarified in the revised manuscript (V2). 'gridding' continuous state/action spaces in order to apply DP-based methods, citing relevant literature. This is an interesting question. This is why the cost of greedy and RRL differ at the first epoch.