Reusing Trajectories in Policy Gradients Enables Fast Convergence

Montenegro, Alessandro, Mansutti, Federico, Mussi, Marco, Papini, Matteo, Metelli, Alberto Maria

Jun-9-2025–arXiv.org Artificial Intelligence

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of $O(ε^{-3/2})$, the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of $\widetilde{O}(ε^{-1})$, the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.

machine learning, reinforcement learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

Jun-9-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Jordan (0.04)
- Europe
  - Italy > Lombardy
    - Milan (0.04)
  - Montenegro (0.04)
- North America > United States
  - California > San Diego County > San Diego (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Statistical Learning > Gradient Descent (0.48)