Reinforcement Learning
Action and Perception as Divergence Minimization
Hafner, Danijar, Ortega, Pedro A., Ba, Jimmy, Parr, Thomas, Friston, Karl, Heess, Nicolas
We introduce a unified objective for action and perception of intelligent agents. Extending representation learning and control, we minimize the joint divergence between the world and a target distribution. Intuitively, such agents use perception to align their beliefs with the world, and use actions to align the world with their beliefs. Minimizing the joint divergence to an expressive target maximizes the mutual information between the agent's representations and inputs, thus inferring representations that are informative of past inputs and exploring future inputs that are informative of the representations. This lets us derive intrinsic objectives, such as representation learning, information gain, empowerment, and skill discovery from minimal assumptions. Moreover, interpreting the target distribution as a latent variable model suggests expressive world models as a path toward highly adaptive agents that seek large niches in their environments, while rendering task rewards optional. The presented framework provides a common language for comparing a wide range of objectives, facilitates understanding of latent variables for decision making, and offers a recipe for designing novel objectives. We recommend deriving future agent objectives from the joint divergence to facilitate comparison, to point out the agent's target distribution, and to identify the intrinsic objective terms needed to reach that distribution.
Optimality-based Analysis of XCSF Compaction in Discrete Reinforcement Learning
Bishop, Jordan T., Gallagher, Marcus
Learning classifier systems (LCSs) are population-based predictive systems that were originally envisioned as agents to act in reinforcement learning (RL) environments. These systems can suffer from population bloat and so are amenable to compaction techniques that try to strike a balance between population size and performance. A well-studied LCS architecture is XCSF, which in the RL setting acts as a Q-function approximator. We apply XCSF to a deterministic and stochastic variant of the FrozenLake8x8 environment from OpenAI Gym, with its performance compared in terms of function approximation error and policy accuracy to the optimal Q-functions and policies produced by solving the environments via dynamic programming. We then introduce a novel compaction algorithm (Greedy Niche Mass Compaction - GNMC) and study its operation on XCSF's trained populations. Results show that given a suitable parametrisation, GNMC preserves or even slightly improves function approximation error while yielding a significant reduction in population size. Reasonable preservation of policy accuracy also occurs, and we link this metric to the commonly used steps-to-goal metric in maze-like environments, illustrating how the metrics are complementary rather than competitive.
Learning to Infer User Hidden States for Online Sequential Advertising
Peng, Zhaoqing, Jin, Junqi, Luo, Lan, Yang, Yaodong, Luo, Rui, Wang, Jun, Zhang, Weinan, Xu, Haiyang, Xu, Miao, Yu, Chuan, Luo, Tiejian, Li, Han, Xu, Jian, Gai, Kun
To drive purchase in online advertising, it is of the advertiser's great interest to optimize the sequential advertising strategy whose performance and interpretability are both important. The lack of interpretability in existing deep reinforcement learning methods makes it not easy to understand, diagnose and further optimize the strategy. In this paper, we propose our Deep Intents Sequential Advertising (DISA) method to address these issues. The key part of interpretability is to understand a consumer's purchase intent which is, however, unobservable (called hidden states). In this paper, we model this intention as a latent variable and formulate the problem as a Partially Observable Markov Decision Process (POMDP) where the underlying intents are inferred based on the observable behaviors. Large-scale industrial offline and online experiments demonstrate our method's superior performance over several baselines. The inferred hidden states are analyzed, and the results prove the rationality of our inference.
Learning to summarize from human feedback
Stiennon, Nisan, Ouyang, Long, Wu, Jeff, Ziegler, Daniel M., Lowe, Ryan, Voss, Chelsea, Radford, Alec, Amodei, Dario, Christiano, Paul
As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about---summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models. We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.
A reinforcement learning approach to hybrid control design
Gandhi, Meet, Kundu, Atreyee, Bhatnagar, Shalabh
In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov Decision Process (MDP). This result facilitates the application of off-the-shelf algorithms from Reinforcement Learning (RL) literature towards designing optimal control policies. Second, we model a set of benchmark examples of hybrid control design problem in the proposed MDP framework. Third, we adapt the recently proposed Proximal Policy Optimisation (PPO) algorithm for the hybrid action space and apply it to the above set of problems. It is observed that in each case the algorithm converges and finds the optimal policy.
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
Li, Gen, Wei, Yuting, Chi, Yuejie, Gu, Yuantao, Chen, Yuxin
We investigate the sample efficiency of reinforcement learning in a $\gamma$-discounted infinite-horizon Markov decision process (MDP) with state space $\mathcal{S}$ and action space $\mathcal{A}$, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^2}$ (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}$ (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an $\varepsilon$-optimal policy with an order of $\frac{|\mathcal{S}||\mathcal{A}| }{(1-\gamma)^3\varepsilon^2}\log\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\varepsilon}$ samples for any $\varepsilon \in (0, \frac{1}{1-\gamma}]$. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible).
DeepMind Found New Approach To Create Faster RL Models
Recently, researchers from DeepMind and McGill University proposed new approaches to speed up the solution of complex reinforcement learning problems. They mainly introduced a divide and conquer approach to reinforcement learning (RL), which is combined with deep learning to scale up the potentials of the agents. For a few years now, reinforcement learning has been providing a conceptual framework in order to address several fundamental problems. This algorithm has been utilised in several applications, such as to model robots, simulate artificial limbs, developing self-driving cars, play games like poker, Go, and more. Also, the recent combination of reinforcement learning with deep learning added several impressive achievements and is found to be a promising approach to tackle important sequential decision-making problems that are currently intractable.
Decentralized reinforcement learning: global decision-making via local economic transactions
Many neural network architectures that underlie various artificial intelligence systems today bear an interesting similarity to the early computers a century ago. Just as early computers were specialized circuits for specific purposes like solving linear systems or cryptanalysis, so too does the trained neural network generally function as a specialized circuit for performing a specific task, with all parameters coupled together in the same global scope. One might naturally wonder what it might take for learning systems to scale in complexity in the same way as programmed systems have. And if the history of how abstraction enabled computer science to scale gives any indication, one possible place to start would be to consider what it means to build complex learning systems at multiple levels of abstraction, where each level of learning is the emergent consequence of learning from the layer below. This post discusses our recent paper that introduces a framework for societal decision-making, a perspective on reinforcement learning through the lens of a self-organizing society of primitive agents.
PlotThread: Creating Expressive Storyline Visualizations using Reinforcement Learning
Tang, Tan, Li, Renzhong, Wu, Xinke, Liu, Shuhan, Knittel, Johannes, Koch, Steffen, Ertl, Thomas, Yu, Lingyun, Ren, Peiran, Wu, Yingcai
Storyline visualizations are an effective means to present the evolution of plots and reveal the scenic interactions among characters. However, the design of storyline visualizations is a difficult task as users need to balance between aesthetic goals and narrative constraints. Despite that the optimization-based methods have been improved significantly in terms of producing aesthetic and legible layouts, the existing (semi-) automatic methods are still limited regarding 1) efficient exploration of the storyline design space and 2) flexible customization of storyline layouts. In this work, we propose a reinforcement learning framework to train an AI agent that assists users in exploring the design space efficiently and generating well-optimized storylines. Based on the framework, we introduce PlotThread, an authoring tool that integrates a set of flexible interactions to support easy customization of storyline visualizations. To seamlessly integrate the AI agent into the authoring process, we employ a mixed-initiative approach where both the agent and designers work on the same canvas to boost the collaborative design of storylines. We evaluate the reinforcement learning model through qualitative and quantitative experiments and demonstrate the usage of PlotThread using a collection of use cases.
Solving the single-track train scheduling problem via Deep Reinforcement Learning
Agasucci, Valerio, Grani, Giorgio, Lamorgese, Leonardo
A rail company organizes its fleet to accommodate expected demands, maximizing revenue and coverage, so that the service is provided to customers as far as possible. From a practical point of view, companies have to make decisions for two different time horizons: offline and online. Offline decisions deal with the problem of routing trains in advance, so that the basic path for each train is decided and in normal conditions, these are the one that will be followed. Decisions in this sense are made sporadically in a year, typically once every three to six months. The planned routes and schedules are usually hand-engineered according to regulation, safety measures, and demand requirements. As said, planned routes are the ones preferred in normal conditions, but this rarely happens since disruptions occur daily in the network. A broken train, a not working switch, delays in the preparation of the train, and many more real-life problems may affect the overall network. Sometimes the delay introduced is small and the planned schedule can still be used, but on other occasions, online rerouting and rescheduling have to be applied. In literature, this online decision making is called the Train Dispatching problem (TD), a real-time variant of the Train Timetabling problem (known to be NPhard [3]).