VIPeR: Provably Efficient Algorithm for Offline RL with Neural Function Approximation

Nguyen-Tang, Thanh, Arora, Raman

arXiv.org Artificial Intelligence 

In this section, we empirically evaluate the proposed algorithm VIPeR against several state-of-the-art baselines, including (a) PEVI (Jin et al., 2021), which explicitly constructs lower confidence bound (LCB) for pessimism in a linear model (thus, we rename this algorithm as LinLCB for convenience in our experiments); (b) NeuraLCB (Nguyen-Tang et al., 2022a) which explicitly constructs an LCB using neural network gradients; (c) NeuraLCB (Diag), which is NeuraLCB with a diagonal approximation for estimating the confidence set as suggested in NeuraLCB (Nguyen-Tang et al., 2022a); (d) Lin-VIPeR which is VIPeR realized to the linear function approximation instead of neural network function approximation; (e) NeuralGreedy (LinGreedy, respectively) which uses neural networks (linear models, respectively) to fit the offline data and act greedily with respect to the estimated state-action value functions without any pessimism. Note that when the parametric class, F, in Algorithm 1 is that of neural networks, we refer to VIPeR as Neural-VIPeR. We do not utilize data splitting in the experiments. We provide further algorithmic details of the baselines in Section H. We evaluate all algorithms in two problem settings: (1) the underlying MDP is a linear MDP whose reward functions and transition kernels are linear in some known feature map (Jin et al., 2020), and (2) the underlying MDP is non-linear with horizon length H = 1 (i.e., non-linear contextual bandits) (Zhou et al., 2020), where the reward function is either synthetic or constructed from MNIST

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found