ImprovingSampleComplexityBoundsfor(Natural) Actor-CriticAlgorithms
–Neural Information Processing Systems
The goal of reinforcement learning (RL) [39] is to maximize the expected total reward by taking actions according toapolicyinastochastic environment, whichismodelled asaMarkovdecision process (MDP) [4]. To obtain an optimal policy, one popular method is the direct maximization of the expected total reward via gradient ascent, which is referred to as the policy gradient (PG) method [40,47].
Neural Information Processing Systems
Feb-7-2026, 23:15:29 GMT
- Country:
- Technology: