Kirschner, Johannes
Confidence Estimation via Sequential Likelihood Mixing
Kirschner, Johannes, Krause, Andreas, Meziu, Michele, Mutny, Mojmir
We present a universal framework for constructing confidence sets based on sequential likelihood mixing. Building upon classical results from sequential analysis, we provide a unifying perspective on several recent lines of work, and establish fundamental connections between sequential mixing, Bayesian inference and regret inequalities from online estimation. The framework applies to any realizable family of likelihood functions and allows for non-i.i.d. data and anytime validity. Moreover, the framework seamlessly integrates standard approximate inference techniques, such as variational inference and sampling-based methods, and extends to misspecified model classes, while preserving provable coverage guarantees. We illustrate the power of the framework by deriving tighter confidence sequences for classical settings, including sequential linear regression and sparse estimation, with simplified proofs.
Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off
Zhang, Zichen, Kirschner, Johannes, Zhang, Junxi, Zanini, Francesco, Ayoub, Alex, Dehghan, Masood, Schuurmans, Dale
A default assumption in reinforcement learning (RL) and optimal control is that observations arrive at discrete time points on a fixed clock cycle. Yet, many applications involve continuous-time systems where the time discretization, in principle, can be managed. The impact of time discretization on RL methods has not been fully characterized in existing theory, but a more detailed analysis of its effect could reveal opportunities for improving data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation for LQR systems and uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently to time discretization, leading to an optimal choice of temporal resolution for a given data budget. These findings show that managing the temporal resolution can provably improve policy evaluation efficiency in LQR systems with finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and standard RL benchmarks for non-linear continuous control.
Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications
Kirschner, Johannes, Lattimore, Tor, Krause, Andreas
Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.
Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning
Tkachuk, Volodymyr, Bakhtiari, Seyed Alireza, Kirschner, Johannes, Jusup, Matej, Bogunovic, Ilija, Szepesvári, Csaba
A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.
Near-optimal Policy Identification in Active Reinforcement Learning
Li, Xiang, Mehta, Viraj, Kirschner, Johannes, Char, Ian, Neiswanger, Willie, Schneider, Jeff, Krause, Andreas, Bogunovic, Ilija
Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a generative model. We propose the AE-LSVI algorithm for bestpolicy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy uniformly over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required. Reinforcement learning (RL) algorithms are increasingly applied to complex domains such as robotics (Kober et al., 2013), magnetic tokamaks (Seo et al., 2021; Degrave et al., 2022), and molecular search (Simm et al., 2020a;b). A central challenge in such environments is that data acquisition is often a time-consuming and expensive process, or may be infeasible due to safety considerations.
Bias-Robust Bayesian Optimization via Dueling Bandits
Kirschner, Johannes, Krause, Andreas
We consider Bayesian optimization in settings where observations can be adversarially biased, for example by an uncontrolled hidden confounder. Our first contribution is a reduction of the confounded setting to the dueling bandit model. Then we propose a novel approach for dueling bandits based on information-directed sampling (IDS). Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees. Our analysis further generalizes a previously proposed semi-parametric linear bandit model to non-linear reward functions, and uncovers interesting links to doubly-robust estimation.
Asymptotically Optimal Information-Directed Sampling
Kirschner, Johannes, Lattimore, Tor, Vernade, Claire, Szepesvári, Csaba
We introduce a computationally efficient algorithm for finite stochastic linear bandits. The approach is based on the frequentist information-directed sampling (IDS) framework, with an information gain potential that is derived directly from the asymptotic regret lower bound. We establish frequentist regret bounds, which show that the proposed algorithm is both asymptotically optimal and worst-case rate optimal in finite time. Our analysis sheds light on how IDS trades off regret and information to incrementally solve the semi-infinite concave program that defines the optimal asymptotic regret. Along the way, we uncover interesting connections towards a recently proposed two-player game approach and the Bayesian IDS algorithm.
Stochastic Bandits with Context Distributions
Kirschner, Johannes, Krause, Andreas
We introduce a novel stochastic contextual bandit model, where at each step the adversary chooses a distribution over a context set. The learner observes only the context distribution while the exact context realization remains hidden. This allows for a broader range of applications, for instance when the context itself is based on predictions. By leveraging the UCB algorithm to this setting, we propose an algorithm that achieves a $\tilde{\mathcal{O}}(d\sqrt{T})$ high-probability regret bound for linearly parametrized reward functions. Our results strictly generalize previous work in the sense that both our model and the algorithm reduce to the standard setting when the environment chooses only Dirac delta distributions and therefore provides the exact context to the learner. We further obtain similar results for a variant where the learner observes the realized context after choosing the action, and we extend the results to the kernelized setting. Finally, we demonstrate the proposed method on synthetic and real-world datasets.
Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces
Kirschner, Johannes, Mutný, Mojmír, Hiller, Nicole, Ischebeck, Rasmus, Krause, Andreas
Bayesian optimization is known to be difficult to scale to high dimensions, because the acquisition step requires solving a non-convex optimization problem in the same search space. In order to scale the method and keep its benefits, we propose an algorithm (LineBO) that restricts the problem to a sequence of iteratively chosen one-dimensional sub-problems. We show that our algorithm converges globally and obtains a fast local rate when the function is strongly convex. Further, if the objective has an invariant subspace, our method automatically adapts to the effective dimension without changing the algorithm. Our method scales well to high dimensions and makes use of a global Gaussian process model. When combined with the SafeOpt algorithm to solve the sub-problems, we obtain the first safe Bayesian optimization algorithm with theoretical guarantees applicable in high-dimensional settings. We evaluate our method on multiple synthetic benchmarks, where we obtain competitive performance. Further, we deploy our algorithm to optimize the beam intensity of the Swiss Free Electron Laser with up to 40 parameters while satisfying safe operation constraints.
Information-Directed Exploration for Deep Reinforcement Learning
Nikolov, Nikolay, Kirschner, Johannes, Berkenkamp, Felix, Krause, Andreas
Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.