Goto

Collaborating Authors

 optimistic variant


A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Neural Information Processing Systems

The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a $\tilde{\mathcal{O}}(d^{3/4}H^2K^{3/4})$ regret for it. Here $d$ is the ambient dimension of linear MDPs, $H$ is the length of each episode, and $K$ is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information. Additionally, our algorithm design features a novel multi-batched updating mechanism and the theoretical analysis utilizes a new covering number argument of value and policy classes, which might be of independent interest.


Review for NeurIPS paper: Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games

Neural Information Processing Systems

Summary and Contributions: - It is known in the literature that optimistic variants of FTRL algorithm can yield better bounds when the sequence of loss functions are predictable. Such results are relatively rare for FTPL. This paper proposes the optimistic variant of the FTPL algorithm, which in the worst case known optimal bounds, but has the potential to achieve better regret for predictable sequence of loss functions. Specifically, the bounds depend on the g_t - abla_t _* where g_t is the estimate of the gradient for the next loss function and abla_t is the observed gradient. They instantiate this generic result for the worst case analysis via treating the future estimate g_t 0 and achieve the optimal O(T {\frac{1}{2}}) regret.


A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Neural Information Processing Systems

The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a \tilde{\mathcal{O}}(d {3/4}H 2K {3/4}) regret for it. Here d is the ambient dimension of linear MDPs, H is the length of each episode, and K is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information.