A., Prashanth L.
Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient Algorithms
Tatlı, Meltem, Mukherjee, Arpan, A., Prashanth L., Shanmugam, Karthikeyan, Tajer, Ali
This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a solitary arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to $O((\log T/T )^{\nu})$, where $T$ is the horizon, and $\nu>0$ is a riskmetric-specific constant.
VaR\ and CVaR Estimation in a Markov Cost Process: Lower and Upper Bounds
Bhat, Sanjay, A., Prashanth L., Thoppe, Gugan
We tackle the problem of estimating the Value-at-Risk (VaR) and the Conditional Value-at-Risk (CVaR) of the infinite-horizon discounted cost within a Markov cost process. First, we derive a minimax lower bound of $\Omega(1/\sqrt{n})$ that holds both in an expected and in a probabilistic sense. Then, using a finite-horizon truncation scheme, we derive an upper bound for the error in CVaR estimation, which matches our lower bound up to constant factors. Finally, we discuss an extension of our estimation scheme that covers more general risk measures satisfying a certain continuity criterion, e.g., spectral risk measures, utility-based shortfall risk. To the best of our knowledge, our work is the first to provide lower and upper bounds on the estimation error for any risk measure within Markovian settings. We remark that our lower bounds also extend to the infinite-horizon discounted costs' mean. Even in that case, our result $\Omega(1/\sqrt{n}) $ improves upon the existing result $\Omega(1/n)$[13].
Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation
Patil, Gandharv, A., Prashanth L., Nagaraj, Dheeraj, Precup, Doina
We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
A Gradient Smoothed Functional Algorithm with Truncated Cauchy Random Perturbations for Stochastic Optimization
Mondal, Akash, A., Prashanth L., Bhatnagar, Shalabh
In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples, and only the latter are observed for any given parameter. Our algorithm employs a gradient estimation scheme with random perturbations, which are formed using the truncated Cauchy distribution from the delta sphere. We analyze the bias and variance of the proposed gradient estimator. Our algorithm is found to be particularly useful in the case when the objective function is non-convex, and the parameter dimension is high. From an asymptotic convergence analysis, we establish that our algorithm converges almost surely to the set of stationary points of the objective function and obtains the asymptotic convergence rate. We also show that our algorithm avoids unstable equilibria, implying convergence to local minima. Further, we perform a non-asymptotic convergence analysis of our algorithm. In particular, we establish here a non-asymptotic bound for finding an epsilon-stationary point of the non-convex objective function. Finally, we demonstrate numerically through simulations that the performance of our algorithm outperforms GSF, SPSA, and RDSA by a significant margin over a few non-convex settings and further validate its performance over convex (noisy) objectives.
A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning
Maniyar, Mizhaan Prajit, Mondal, Akash, A., Prashanth L., Bhatnagar, Shalabh
We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.
A Survey of Risk-Aware Multi-Armed Bandits
Tan, Vincent Y. F., A., Prashanth L., Jagannathan, Krishna
There are two general does not satisfactorily capture the merits sub-problems in the MAB literature, namely, regret of a drug or a portfolio. In such applications, minimization and best-arm identification (also risk plays a crucial role, and a riskaware called pure exploration). In the former, in which performance measure is preferable, the exploration-exploitation trade-off manifests, the so as to capture losses in the case of adverse player wants to maximize his reward over a fixed events. This survey aims to consolidate time period. In the latter, the player simply wants and summarise the existing research to learn which arm is the best in either the quickest on risk measures, specifically in the context time possible with a given probability of success(the of multi-armed bandits. We review fixed-confidence setting) or he wants to do so with various risk measures of interest, and comment the highest probability of success given a fixed playing on their properties. Next, we review horizon (the fixed budget setting). In most of existing concentration inequalities for various the MAB literature (see Lattimore and Szepesvári risk measures. Then, we proceed to (2020) for an up-to-date survey), the metric of interest defining risk-aware bandit problems, We is defined simply as the mean of the reward consider algorithms for the regret minimization distribution associated with the arm pulled.
Online Estimation and Optimization of Utility-Based Shortfall Risk
Menon, Arvind S., A., Prashanth L., Jagannathan, Krishna
In several financial applications, it is necessary to understand risk sensitivity while maximizing the returns. Several risk measures have been studied in the literature, e.g., mean-variance, Value at Risk (VaR), Conditional Value at Risk (CVaR), distorted risk measure, and prospect theory. In [2], the authors consider four properties as desirable for a risk measure, namely positive homogeneity, translation invariance, sub-additivity, and monotonicity. They define a risk measure as being coherent if it possesses the aforementioned properties. In a related development, in[19], the authors chose to relax the sub-additivity and positive homogeneity requirements of a coherent risk measure, and instead impose a convexity condition on the underlying risk measure.
(Bandit) Convex Optimization with Biased Noisy Gradient Oracles
Hu, Xiaowei, A., Prashanth L., György, András, Szepesvári, Csaba
Algorithms for bandit convex optimization and online learning often rely on constructing noisy gradient estimates, which are then used in appropriately adjusted first-order algorithms, replacing actual gradients. Depending on the properties of the function to be optimized and the nature of ``noise'' in the bandit feedback, the bias and variance of gradient estimates exhibit various tradeoffs. In this paper we propose a novel framework that replaces the specific gradient estimation methods with an abstract oracle. With the help of the new framework we unify previous works, reproducing their results in a clean and concise fashion, while, perhaps more importantly, the framework also allows us to formally show that to achieve the optimal root-$n$ rate either the algorithms that use existing gradient estimators, or the proof techniques used to analyze them have to go beyond what exists today.
Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint
A., Prashanth L., Fu, Michael
The classic objective in a reinforcement learning (RL) problem is to find a policy that minimizes, in expectation, a long-run objective such as the infinite-horizon discounted or long-run average cost. In many practical applications, optimizing the expected value alone is not sufficient, and it may be necessary to include a risk measure in the optimization process, either as the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., mean-variance tradeoff, exponential utility, the percentile performance, value at risk, conditional value at risk, prospect theory and its later enhancement, cumulative prospect theory. In this article, we focus on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied. We introduce the risk-constrained RL framework, cover popular risk measures based on variance, conditional value-at-risk and cumulative prospect theory, and present a template for a risk-sensitive RL algorithm. We survey some of our recent work on this topic, covering problems encompassing discounted cost, average cost, and stochastic shortest path settings, together with the aforementioned risk measures in a constrained framework. This non-exhaustive survey is aimed at giving a flavor of the challenges involved in solving a risk-sensitive RL problem, and outlining some potential future research directions.
Concentration bounds for empirical conditional value-at-risk: The unbounded case
Kolla, Ravi Kumar, A., Prashanth L., Bhat, Sanjay P., Jagannathan, Krishna
In several real-world applications involving decision making under uncertainty, the traditional expected value objective may not be suitable, as it may be necessary to control losses in the case of a rare but extreme event. Conditional Value-at-Risk (CVaR) is a popular risk measure for modeling the aforementioned objective. We consider the problem of estimating CVaR from i.i.d. samples of an unbounded random variable, which is either sub-Gaussian or sub-exponential. We derive a novel one-sided concentration bound for a natural sample-based CVaR estimator in this setting. Our bound relies on a concentration result for a quantile-based estimator for Value-at-Risk (VaR), which may be of independent interest.