cartpole
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.05)
- North America > United States > Maryland (0.04)
- Asia > Middle East > Jordan (0.04)
967990de5b3eac7b87d49a13c6834978-AuthorFeedback.pdf
Thank reviewers for the comments. Please find our responses below, with reference indices consistent with the paper . Q3-5: Meaning of the learned divergence? We agree that BC minimizes the policy KL divergence as what we noted in Sec. 4 (line 200). It is consistent with the literature, e.g., Sec. 2 in [Y u et al. arXiv:1909.09314].
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Quantifying Memory Use in Reinforcement Learning with Temporal Range
Lafuente-Mercado, Rodney, Rus, Daniela, Rusch, T. Konstantin
How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $\partial y_s/\partial x_t\in\mathbb{R}^{c\times d}$ averaged over final timesteps $s\in\{t+1,\dots,T\}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
- Information Technology > Artificial Intelligence > Robots (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)
Supplementary Materials A Organization of Supplementary Materials
The supplementary materials consist of five main sections. In Appendix B, we give a detailed overview of the related literature. Proofs for Section 3. In Appendix C, we give the proofs of Theorem 1 and Proposition 1. Algorithm and Implementation Details. In Appendix D, we provide further details about the implementation and training procedure for PerSim and the RL methods we benchmark against. In Appendix E, we detail the setup used to run our experiments.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)