Goto

Collaborating Authors

 r-learning


R-learninginactor-criticmodeloffersabiologically relevantmechanismforsequentialdecision-making

Neural Information Processing Systems

Afewstudies haveexplored sequential stay-or-leavedecisions in humans, or rodents - the model organism used to access neuronal activity at high resolution. In both cases, decision patterns were collected inforaging tasks-the experimental settings where subjects decide when to leave depleting resources (2).


R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Neural Information Processing Systems

In real-world settings, we repeatedly decide whether to pursue better conditions or to keep things unchanged. Examples include time investment, employment, entertainment preferences etc. How do we make such decisions? To address this question, the field of behavioral ecology has developed foraging paradigms - the model settings in which human and non-human subjects decided when to leave depleting food resources. Foraging theory, represented by the marginal value theorem (MVT), provided accurate average-case stay-or-leave rules consistent with behaviors of subjects towards depleting resources. Yet, the algorithms underlying individual choices and ways to learn such algorithms remained unclear.



We thank the reviewers for their time and thorough comments, as well as their valuation of our work including its

Neural Information Processing Systems

For the larger discussion items, please find the detailed comments below. Additionally, the reviewers highlighted the importance of quantitative fits. We currently attempt to differentiate between these models using additional manipulations. R-learning may be advantageous for computation. Our work builds upon results in the field including Ref [2] This observation enabled us to pursue the hypothesis of the leaky estimate of average reward.


Review for NeurIPS paper: R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Neural Information Processing Systems

Weaknesses: More attention should be paid for teasing out differences between V and R learning, with intermittent initial rewards being essentially the only example. Although it is impressive that new VTA recording data is presented in the paper, I don't feel that the result is particularly helpful - it only shows that VTA activity doesn't contradict R-learning model, but it does not really provide specific support for it. It should be possible to design different tasks/protocols under which the two formalisations would have substantially different TD errors, which could help tease out biological correlates of the two models. Furthermore, it would be nice to see more details of parameter estimation and the resulting best-fitting parameter values, which if done properly, may allow to achieve not only a qualitative but also a better quantitative fit between Figure 1E and Figure 1D (as well as between Figure 1D and Figure 1B). As the models have multiple parameters substantially affecting performance, the two models should be compared under best-fitting parameters and should include formal measures like AIC, not just qualitative fits. Of course model universality regardless of parameters is helpful, but quantitative fit is equally important.


Review for NeurIPS paper: R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Neural Information Processing Systems

This is a well-written and presented paper proposing a new framework for modeling animal behavior during a foraging task, and should be of interest to the NeurIPS audience. After rebuttal, 3 of the reviewers recommended accept based on it providing a nice link between the behavioral economics and reinforcement learning communities, and its strengths in both theory and empirical results. Therefore, I tentatively recommend accept. That said, during the discussions some concerns were brought up regarding some missing related work. I urge the authors to consider discussing in their final version several related works that R4, and I think are quite relevant: Daw et al, 2002, Neural Networks; Schwighofer & Doya 2003, Neural Networks; Niv et al 2006/2007 (and related), and also some works from motivation modeling literature (that R2 mentions in their review).


R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Neural Information Processing Systems

In real-world settings, we repeatedly decide whether to pursue better conditions or to keep things unchanged. Examples include time investment, employment, entertainment preferences etc. How do we make such decisions? To address this question, the field of behavioral ecology has developed foraging paradigms – the model settings in which human and non-human subjects decided when to leave depleting food resources. Foraging theory, represented by the marginal value theorem (MVT), provided accurate average-case stay-or-leave rules consistent with behaviors of subjects towards depleting resources. Yet, the algorithms underlying individual choices and ways to learn such algorithms remained unclear.


The Connection Between R-Learning and Inverse-Variance Weighting for Estimation of Heterogeneous Treatment Effects

Fisher, Aaron

arXiv.org Machine Learning

Our motivation is to shed light the performance of the widely popular "R-Learner." Like many other methods for estimating conditional average treatment effects (CATEs), R-Learning can be expressed as a weighted pseudo-outcome regression (POR). Previous comparisons of POR techniques have paid careful attention to the choice of pseudo-outcome transformation. However, we argue that the dominant driver of performance is actually the choice of weights. Specifically, we argue that R-Learning implicitly performs an inverse-variance weighted form of POR. These weights stabilize the regression and allow for convenient simplifications of bias terms.


Deep R-Learning for Continual Area Sweeping

Shah, Rishi, Jiang, Yuqian, Hart, Justin, Stone, Peter

arXiv.org Machine Learning

Coverage path planning is a well-studied problem in robotics in which a robot must plan a path that passes through every point in a given area repeatedly, usually with a uniform frequency. To address the scenario in which some points need to be visited more frequently than others, this problem has been extended to non-uniform coverage planning. This paper considers the variant of non-uniform coverage in which the robot does not know the distribution of relevant events beforehand and must nevertheless learn to maximize the rate of detecting events of interest. This continual area sweeping problem has been previously formalized in a way that makes strong assumptions about the environment, and to date only a greedy approach has been proposed. We generalize the continual area sweeping formulation to include fewer environmental constraints, and propose a novel approach based on reinforcement learning in a Semi-Markov Decision Process. This approach is evaluated in an abstract simulation and in a high fidelity Gazebo simulation. These evaluations show significant improvement upon the existing approach in general settings, which is especially relevant in the growing area of service robotics.


Efficient Average Reward Reinforcement Learning Using Constant Shifting Values

Yang, Shangdong (Nanjing University) | Gao, Yang (Nanjing University) | An, Bo (Nanyang Technological University) | Wang, Hao (Nanjing University) | Chen, Xingguo (Nanjing University of Posts and Telecommunications)

AAAI Conferences

There are two classes of average reward reinforcement learning (RL) algorithms: model-based ones that explicitly maintain MDP models and model-free ones that do not learn such models. Though model-free algorithms are known to be more efficient, they often cannot converge to optimal policies due to the perturbation of parameters. In this paper, a novel model-free algorithm is proposed, which makes use of constant shifting values (CSVs) estimated from prior knowledge. To encourage exploration during the learning process, the algorithm constantly subtracts the CSV from the rewards. A terminating condition is proposed to handle the unboundedness of Q-values caused by such substraction. The convergence of the proposed algorithm is proved under very mild assumptions. Furthermore, linear function approximation is investigated to generalize our method to handle large-scale tasks. Extensive experiments on representative MDPs and the popular game Tetris show that the proposed algorithms significantly outperform the state-of-the-art ones.