Review for NeurIPS paper: R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Neural Information Processing Systems 

Weaknesses: More attention should be paid for teasing out differences between V and R learning, with intermittent initial rewards being essentially the only example. Although it is impressive that new VTA recording data is presented in the paper, I don't feel that the result is particularly helpful - it only shows that VTA activity doesn't contradict R-learning model, but it does not really provide specific support for it. It should be possible to design different tasks/protocols under which the two formalisations would have substantially different TD errors, which could help tease out biological correlates of the two models. Furthermore, it would be nice to see more details of parameter estimation and the resulting best-fitting parameter values, which if done properly, may allow to achieve not only a qualitative but also a better quantitative fit between Figure 1E and Figure 1D (as well as between Figure 1D and Figure 1B). As the models have multiple parameters substantially affecting performance, the two models should be compared under best-fitting parameters and should include formal measures like AIC, not just qualitative fits. Of course model universality regardless of parameters is helpful, but quantitative fit is equally important.