Plotting

 Reisinger, Christoph


Efficient Learning for Entropy-Regularized Markov Decision Processes via Multilevel Monte Carlo

arXiv.org Machine Learning

Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments.


$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control

arXiv.org Machine Learning

In reinforcement learning (RL), off-policy evaluation (OPE) deals with the problem of estimating the value of a target policy with observations generated from a different behavior policy. OPE methods are typically applied to sequential decision making problems where observational data is available but experimentation with the environment directly is not possible or costly. More broadly, OPE methods are a widely researched subject in RL (see, e.g., [60, 23, 58] for recent overviews), however, relatively little attention has been paid to stochastic environments where the stochasticity depends on the chosen actions and state and action spaces are continuous. For example, common benchmark problems are either deterministic or have finite state and/or action spaces (see, e.g., [60, 23]). Notwithstanding this, stochastic control problems are precisely concerned with the setting where a decision process affects random transitions. Stochastic control is a field closely related to reinforcement learning and its methods have been applied to a wide range of high-stakes decision-making problems in diverse fields such as operations research [24, 41], economics [31, 29], electrical engineering [44, 17], autonomous driving [62] and finance [15, 55]. In the stochastic control literature, optimal policies are often represented as deterministic feedback policies (i.e., as deterministic functions of the current state) and, in the episodic case, are non-stationary due to the impact of a finite time-horizon. Stochastic control environments pose a challenging setting for OPE methods. For example, classical methods like importance sampling (IS) [50] struggle with deterministic target policies in continuous action spaces due to the severe policy mismatch between the target and the behavior policy (see, e.g.


Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems

arXiv.org Artificial Intelligence

We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.


Linear convergence of a policy gradient method for some finite horizon continuous time control problems

arXiv.org Artificial Intelligence

Despite its popularity in the reinforcement learning community, a provably convergent policy gradient method for continuous space-time control problems with nonlinear state dynamics has been elusive. This paper proposes proximal gradient algorithms for feedback controls of finite-time horizon stochastic control problems. The state dynamics are nonlinear diffusions with control-affine drift, and the cost functions are nonconvex in the state and nonsmooth in the control. The system noise can degenerate, which allows for deterministic control problems as special cases. We prove under suitable conditions that the algorithm converges linearly to a stationary point of the control problem, and is stable with respect to policy updates by approximate gradient steps. The convergence result justifies the recent reinforcement learning heuristics that adding entropy regularization or a fictitious discount factor to the optimization objective accelerates the convergence of policy gradient methods.


Arbitrage-free neural-SDE market models

arXiv.org Machine Learning

Modelling joint dynamics of liquid vanilla options is crucial for arbitrage-free pricing of illiquid derivatives and managing risks of option trade books. This paper develops a nonparametric model for the European options book respecting underlying financial constraints and while being practically implementable. We derive a state space for prices which are free from static (or model-independent) arbitrage and study the inference problem where a model is learnt from discrete time series data of stock and option prices. We use neural networks as function approximators for the drift and diffusion of the modelled SDE system, and impose constraints on the neural nets such that no-arbitrage conditions are preserved. In particular, we give methods to calibrate \textit{neural SDE} models which are guaranteed to satisfy a set of linear inequalities. We validate our approach with numerical experiments using data generated from a Heston stochastic local volatility model.


Understanding Deep Architectures with Reasoning Layer

arXiv.org Machine Learning

Recently, there has been a surge of interest in combining deep learning models with reasoning in order to handle more sophisticated learning tasks. In many cases, a reasoning task can be solved by an iterative algorithm. This algorithm is often unrolled, and used as a specialized layer in the deep architecture, which can be trained end-to-end with other neural components. Although such hybrid deep architectures have led to many empirical successes, the theoretical foundation of such architectures, especially the interplay between algorithm layers and other neural layers, remains largely unexplored. In this paper, we take an initial step towards an understanding of such hybrid deep architectures by showing that properties of the algorithm layers, such as convergence, stability, and sensitivity, are intimately related to the approximation and generalization abilities of the end-to-end model. Furthermore, our analysis matches closely our experimental observations under various conditions, suggesting that our theory can provide useful guidelines for designing deep architectures with reasoning layers.