Reinforcement Learning
Frequentist Regret Analysis of Gaussian Process Thompson Sampling via Fractional Posteriors
Roy, Somjit, Jaiswal, Prateek, Bhattacharya, Anirban, Pati, Debdeep, Mallick, Bani K.
We study Gaussian Process Thompson Sampling (GP-TS) for sequential decision-making over compact, continuous action spaces and provide a frequentist regret analysis based on fractional Gaussian process posteriors, without relying on domain discretization as in prior work. We show that the variance inflation commonly assumed in existing analyses of GP-TS can be interpreted as Thompson Sampling with respect to a fractional posterior with tempering parameter $ฮฑ\in (0,1)$. We derive a kernel-agnostic regret bound expressed in terms of the information gain parameter $ฮณ_t$ and the posterior contraction rate $ฮต_t$, and identify conditions on the Gaussian process prior under which $ฮต_t$ can be controlled. As special cases of our general bound, we recover regret of order $\tilde{\mathcal{O}}(T^{\frac{1}{2}})$ for the squared exponential kernel, $\tilde{\mathcal{O}}(T^{\frac{2ฮฝ+3d}{2(2ฮฝ+d)}} )$ for the Matรฉrn-$ฮฝ$ kernel, and a bound of order $\tilde{\mathcal{O}}(T^{\frac{2ฮฝ+3d}{2(2ฮฝ+d)}})$ for the rational quadratic kernel. Overall, our analysis provides a unified and discretization-free regret framework for GP-TS that applies broadly across kernel classes.
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
However, existing offline RL methods tend to behave poorly during fine-tuning. In this paper, we study the fine-tuning problem in the context of conservative offline RL methods and we devise an approach for learning an effective initialization from offline data that also enables fast online fine-tuning capabilities.