Reinforcement Learning
[R] A Bayesian Perspective on Q-Learning
Sounds good, looks like I'll be making another exposition then! So in terms of making interactive documents like this, you have a few options. I'll list them in order of easiest to hardest (assuming you code in python and don't know much web dev): You will see that you can set up various toggles to run your visualizations. The one drawback is that it's not as interactive in "real time" because every time you reconfigure the parameters you have to re-run the cell to show the results. If you're interested in this approach just add a cell block, then click on the three dots, and then click "Add a form".
Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration
Agrawal, Priyank, Chen, Jinglin, Jiang, Nan
This paper studies regret minimization with randomized value functions in reinforcement learning. In tabular finite-horizon Markov Decision Processes, we introduce a clipping variant of one classical Thompson Sampling (TS)-like algorithm, randomized least-squares value iteration (RLSVI). We analyze the algorithm using a novel intertwined regret decomposition. Our $\tilde{\mathrm{O}}(H^2S\sqrt{AT})$ high-probability worst-case regret bound improves the previous sharpest worst-case regret bounds for RLSVI and matches the existing state-of-the-art worst-case TS-based regret bounds.
Learning Guidance Rewards with Trajectory-space Smoothing
Gangwani, Tanmay, Zhou, Yuan, Peng, Jian
Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein -- starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks. Due to the ease of integration, we use the guidance rewards in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present results in single-agent and multi-agent tasks that elucidate the benefit of our approach when the environmental rewards are sparse or delayed.
Option Hedging with Risk Averse Reinforcement Learning
Vittori, Edoardo, Trapletti, Michele, Restelli, Marcello
In this paper we show how risk-averse reinforcement learning can In this paper we focus on the option hedging problem in a realistic be used to hedge options. We apply a state-of-the-art risk-averse environment where we exploit the power of Reinforcement algorithm: Trust Region Volatility Optimization (TRVO) to a vanilla Learning (RL). In some sense we aim at replicating and hopefully option hedging environment, considering realistic factors such as improving, in an automatic way, the trader's experience of containing discrete time and transaction costs. Realism makes the problem both risk and hedging costs. While there is an extensive twofold: the agent must both minimize volatility and contain transaction literature on both option hedging [13] and reinforcement learning costs, these tasks usually being in competition. We use the [29], there are very few works on the combined topics, the main algorithm to train a sheaf of agents each characterized by a different ones being [4, 5, 11, 15], which we will analyze in Section 5. risk aversion, so to be able to span an efficient frontier on Here we implement a robust tool capable of providing the trader the volatility-p&l space. The results show that the derived hedging with a hedging signal more accurate than the delta hedge, as it strategy not only outperforms the Black & Scholes delta hedge, is optimized in a realistic environment, with discrete time and but is also extremely robust and flexible, as it can efficiently hedge transaction costs. We achieve this result through the use of riskaverse options with different characteristics and work on markets with RL by applying TRVO [2], an algorithm capable of optimizing different behaviors than what was used in training.
Goal-directed Generation of Discrete Structures with Conditional Generative Models
Mollaysa, Amina, Paige, Brooks, Kalousis, Alexandros
Despite recent advances, goal-directed generation of structured discrete data remains challenging. For problems such as program synthesis (generating source code) and materials design (generating molecules), finding examples which satisfy desired constraints or exhibit desired properties is difficult. In practice, expensive heuristic search or reinforcement learning algorithms are often employed. In this paper we investigate the use of conditional generative models which directly attack this inverse problem, by modeling the distribution of discrete structures given properties of interest. Unfortunately, maximum likelihood training of such models often fails with the samples from the generative model inadequately respecting the input properties. To address this, we introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We avoid high-variance score-function estimators that would otherwise be required by sampling from an approximation to the normalized rewards, allowing simple Monte Carlo estimation of model gradients. We test our methodology on two tasks: generating molecules with user-defined properties, and identifying short python expressions which evaluate to a given target value. In both cases we find improvements over maximum likelihood estimation and other baselines.
Adaptive Discretization for Model-Based Reinforcement Learning
Sinclair, Sean R., Wang, Tianyu, Jain, Gauri, Banerjee, Siddhartha, Yu, Christina Lee
We introduce the technique of adaptive discretization to design an efficient model-based episodic reinforcement learning algorithm in large (potentially continuous) state-action spaces. Our algorithm is based on optimistic one-step value iteration extended to maintain an adaptive discretization of the space. From a theoretical perspective we provide worst-case regret bounds for our algorithm which are competitive compared to the state-of-the-art model-based algorithms. Moreover, our bounds are obtained via a modular proof technique which can potentially extend to incorporate additional structure on the problem. From an implementation standpoint, our algorithm has much lower storage and computational requirements due to maintaining a more efficient partition of the state and action spaces. We illustrate this via experiments on several canonical control problems, which shows that our algorithm empirically performs significantly better than fixed discretization in terms of both faster convergence and lower memory usage. Interestingly, we observe empirically that while fixed-discretization model-based algorithms vastly outperform their model-free counterparts, the two achieve comparable performance with adaptive discretization.
Towards Safe Policy Improvement for Non-Stationary MDPs
Chandak, Yash, Jordan, Scott M., Theocharous, Georgios, White, Martha, Thomas, Philip S.
Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems. Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis. Safety is ensured using sequential hypothesis testing of a policy's forecasted performance, and confidence intervals are obtained using wild bootstrap.
Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning
Zhang, Cong, Song, Wen, Cao, Zhiguang, Zhang, Jie, Tan, Puay Siew, Xu, Chi
Priority dispatching rule (PDR) is widely used for solving real-world Job-shop scheduling problem (JSSP). However, the design of effective PDRs is a tedious task, requiring a myriad of specialized knowledge and often delivering limited performance. In this paper, we propose to automatically learn PDRs via an end-to-end deep reinforcement learning agent. We exploit the disjunctive graph representation of JSSP, and propose a Graph Neural Network based scheme to embed the states encountered during solving. The resulting policy network is size-agnostic, effectively enabling generalization on large-scale instances. Experiments show that the agent can learn high-quality PDRs from scratch with elementary raw features, and demonstrates strong performance against the best existing PDRs. The learned policies also perform well on much larger instances that are unseen in training.
Batch Exploration with Examples for Scalable Robotic Reinforcement Learning
Chen, Annie S., Nam, HyunJi, Nair, Suraj, Finn, Chelsea
Learning from diverse offline datasets is a promising path towards learning general purpose robotic agents. However, a core challenge in this paradigm lies in collecting large amounts of meaningful data, while not depending on a human in the loop for data collection. One way to address this challenge is through task-agnostic exploration, where an agent attempts to explore without a task-specific reward function, and collect data that can be useful for any downstream task. While these approaches have shown some promise in simple domains, they often struggle to explore the relevant regions of the state space in more challenging settings, such as vision based robotic manipulation. This challenge stems from an objective that encourages exploring everything in a potentially vast state space. To mitigate this challenge, we propose to focus exploration on the important parts of the state space using weak human supervision. Concretely, we propose an exploration technique, Batch Exploration with Examples (BEE), that explores relevant regions of the state-space, guided by a modest number of human provided images of important states. These human provided images only need to be collected once at the beginning of data collection and can be collected in a matter of minutes, allowing us to scalably collect diverse datasets, which can then be combined with any batch RL algorithm. We find that BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot, and observe that compared to task-agnostic and weakly-supervised exploration techniques, it (1) interacts more than twice as often with relevant objects, and (2) improves downstream task performance when used in conjunction with offline RL.