pessimism
- Asia > Middle East > Jordan (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (4 more...)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (6 more...)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.27)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
- (2 more...)
- North America > United States > Illinois (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Pessimism for Offline Linear Contextual Bandits using \ell_p Confidence Sets
We present a family $\{\widehat{\pi}_p\}_{p\ge 1}$ of pessimistic learning rules for offline learning of linear contextual bandits, relying on confidence sets with respect to different $\ell_p$ norms, where $\widehat{\pi}_2$ corresponds to Bellman-consistent pessimism (BCP), while $\widehat{\pi}_\infty$ is a novel generalization of lower confidence bound (LCB) to the linear setting. We show that the novel $\widehat{\pi}_\infty$ learning rule is, in a sense, adaptively optimal, as it achieves the minimax performance (up to log factors) against all $\ell_q$-constrained problems, and as such it strictly dominates all other predictors in the family, including $\widehat{\pi}_2$.
Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism
Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main methods are used: imitation learning which is suitable for expert datasets, and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets often deviate from these two extremes and the exact data composition is usually unknown. To bridge this gap, we present a new offline RL framework that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. The new framework is centered around a weak version of the concentrability coefficient that measures the deviation of the behavior policy from the expert policy alone.
Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality
We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality (asymptotic optimality for short) has recently alluded researchers. On the other hand, it is known that one can achieve bounded regret (i.e., does not grow indefinitely with $n$) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an $\omega(1)$ term w.r.t. the time horizon $n$ in their regret, failing to adapt to the ``easiness'' of the instance. In this paper, we focus on the finite hypothesis case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer by introducing a new algorithm called CRush Optimism with Pessimism (CROP) that eliminates optimistic hypotheses by pulling the informative arms indicated by a pessimistic hypothesis. Our finite-time analysis shows that CROP $(i)$ achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, $(ii)$ adapts to bounded regret, and $(iii)$ its regret bound scales not with $K$ but with an effective number of arms $K_\psi$ that we introduce. We also discuss a problem class where CROP can be exponentially better than existing algorithms in \textit{nonasymptotic} regimes. This problem class also reveals a surprising fact that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret.