reward augmentation
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (3 more...)
Hedging as Reward Augmentation in Probabilistic Graphical Models
We argue that hedging is an activity that human and machine agents should engage in more broadly, even when the agent's value is not necessarily in monetary units. In this paper, we propose a decision-theoretic view of hedging based on augmenting a probabilistic graphical model -- specifically a Bayesian network or an influence diagram -- with a reward. Hedging is therefore posed as a particular kind of graph manipulation, and can be viewed as analogous to control/intervention and information gathering related analysis. Effective hedging occurs when a risk-averse agent finds opportunity to balance uncertain rewards in their current situation. We illustrate the concepts with examples and counter-examples, and conduct experiments to demonstrate the properties and applicability of the proposed computational tools that enable agents to proactively identify potential hedging opportunities in real-world situations.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
- Banking & Finance > Trading (1.00)
- Information Technology (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- Information Technology > Decision Support Systems (0.98)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.96)
Hedging as Reward Augmentation in Probabilistic Graphical Models
We argue that hedging is an activity that human and machine agents should engage in more broadly, even when the agent's value is not necessarily in monetary units. In this paper, we propose a decision-theoretic view of hedging based on augmenting a probabilistic graphical model -- specifically a Bayesian network or an influence diagram -- with a reward. Hedging is therefore posed as a particular kind of graph manipulation, and can be viewed as analogous to control/intervention and information gathering related analysis. Effective hedging occurs when a risk-averse agent finds opportunity to balance uncertain rewards in their current situation. We illustrate the concepts with examples and counter-examples, and conduct experiments to demonstrate the properties and applicability of the proposed computational tools that enable agents to proactively identify potential hedging opportunities in real-world situations.
Human-Agent Coordination in Games under Incomplete Information via Multi-Step Intent
Chen, Shenghui, Zhao, Ruihan, Chinchali, Sandeep, Topcu, Ufuk
Strategic coordination between autonomous agents and human partners under incomplete information can be modeled as turn-based cooperative games. We extend a turn-based game under incomplete information, the shared-control game, to allow players to take multiple actions per turn rather than a single action. The extension enables the use of multi-step intent, which we hypothesize will improve performance in long-horizon tasks. To synthesize cooperative policies for the agent in this extended game, we propose an approach featuring a memory module for a running probabilistic belief of the environment dynamics and an online planning algorithm called IntentMCTS. This algorithm strategically selects the next action by leveraging any communicated multi-step intent via reward augmentation while considering the current belief. Agent-to-agent simulations in the Gnomes at Night testbed demonstrate that IntentMCTS requires fewer steps and control switches than baseline methods. A human-agent user study corroborates these findings, showing an 18.52% higher success rate compared to the heuristic baseline and a 5.56% improvement over the single-step prior work. Participants also report lower cognitive load, frustration, and higher satisfaction with the IntentMCTS agent partner.
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > United States > Ohio (0.04)
- (5 more...)
Reviews: InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations
Paper Summary: This paper focuses on using GANs for imitation learning using trajectories from an expert. The authors extend the GAIL (Generative Adversarial Imitation Learning) framework by including a term in the objective function to incorporate latent structure (similar to InfoGAN). The authors then proceed to show that using their framework, which they call InfoGAIL, they are able to learn interpretable latent structure when the expert policy has multiple modes and that in some setting this robustness allows them to outperform current methods. Paper Overview: The paper is generally well written. I appreciated that the authors first demon- started how the mechanism works on a toy 2D plane example before moving onto more complex driving simulation environment. This helped illustrate the core concepts of allowing the learned policy to be conditioned on a latent variable in a minimalistic setting before moving on to a more complex 3D driving simulation.
Reward Augmentation in Reinforcement Learning for Testing Distributed Systems
Borgarelli, Andrea, Enea, Constantin, Majumdar, Rupak, Nagendra, Srinidhi
Bugs in popular distributed protocol implementations have been the source of many downtimes in popular internet services. We describe a randomized testing approach for distributed protocol implementations based on reinforcement learning. Since the natural reward structure is very sparse, the key to successful exploration in reinforcement learning is reward augmentation. We show two different techniques that build on one another. First, we provide a decaying exploration bonus based on the discovery of new states -- the reward decays as the same state is visited multiple times. The exploration bonus captures the intuition from coverage-guided fuzzing of prioritizing new coverage points; in contrast to other schemes, we show that taking the maximum of the bonus and the Q-value leads to more effective exploration. Second, we provide waypoints to the algorithm as a sequence of predicates that capture interesting semantic scenarios. Waypoints exploit designer insight about the protocol and guide the exploration to ``interesting'' parts of the state space. Our reward structure ensures that new episodes can reliably get to deep interesting states even without execution caching. We have implemented our algorithm in Go. Our evaluation on three large benchmarks (RedisRaft, Etcd, and RSL) shows that our algorithm can significantly outperform baseline approaches in terms of coverage and bug finding.
Generative Augmented Flow Networks
Pan, Ling, Zhang, Dinghuai, Courville, Aaron, Huang, Longbo, Bengio, Yoshua
Deep reinforcement learning (RL) has achieved significant progress in recent years with particular success in games (Mnih et al., 2015, Silver et al., 2016, Vinyals et al., 2019). RL methods applied to the setting where a reward is only given at the end (i.e., terminal states) typically aim at maximizing that reward function for learning the optimal policy. However, diversity of the generated states is desirable in a wide range of practical scenarios including molecule generation (Bengio et al., 2021a), biological sequence design (Jain et al., 2022b), recommender systems (Kunaver and Požrl, 2017), dialogue systems (Zhang et al., 2020), etc. For example, in molecule generation, the reward function used in in-silico simulations can be uncertain and imperfect itself (compared to the more expensive in-vivo experiments). Therefore, it is not sufficient to only search the solution that maximizes the return. Instead, it is desired that we sample many high-reward candidates, which can be achieved by sampling them proportionally to the reward of each terminal state. Interestingly, GFlowNets (Bengio et al., 2021a,b) learn a stochastic policy to sample composite objects x X with probability proportional to the return R(x) . The learning paradigm of GFlowNets is different from other RL methods, as it is explicitly aiming at modeling the diversity in the target distribution, i.e., all the modes of the reward function. This makes it natural for practical applications where the model should discover objects that are both interesting and diverse, which is a focus of previous GFlowNet works (Bengio et al., 2021a,b, Jain et al., 2022b, Malkin et al., 2022).
Student/Teacher Advising through Reward Augmentation
Transfer learning is an important new subfield of multiagent reinforcement learning that aims to help an agent learn about a problem by using knowledge that it has gained solving another problem, or by using knowledge that is communicated to it by an agent who already knows the problem. This is useful when one wishes to change the architecture or learning algorithm of an agent (so that the new knowledge need not be built "from scratch"), when new agents are frequently introduced to the environment with no knowledge, or when an agent must adapt to similar but different problems. Great progress has been made in the agent-to-agent case using the Teacher/Student framework proposed by (Torrey and Taylor 2013). However, that approach requires that learning from a teacher be treated differently from learning in every other reinforcement learning context. In this paper, I propose a method which allows the teacher/student framework to be applied in a way that fits directly and naturally into the more general reinforcement learning framework by integrating the teacher feedback into the reward signal received by the learning agent. I show that this approach can significantly improve the rate of learning for an agent playing a one-player stochastic game; I give examples of potential pitfalls of the approach; and I propose further areas of research building on this framework.
Adaptive Stress Testing with Reward Augmentation for Autonomous Vehicle Validation
Corso, Anthony, Du, Peter, Driggs-Campbell, Katherine, Kochenderfer, Mykel J.
Determining possible failure scenarios is a critical step in the evaluation of autonomous vehicle systems. Real-world vehicle testing is commonly employed for autonomous vehicle validation, but the costs and time requirements are high. Consequently, simulation-driven methods such as Adaptive Stress Testing (AST) have been proposed to aid in validation. AST formulates the problem of finding the most likely failure scenarios as a Markov decision process, which can be solved using reinforcement learning. In practice, AST tends to find scenarios where failure is unavoidable and tends to repeatedly discover the same types of failures of a system. This work addresses these issues by encoding domain relevant information into the search procedure. With this modification, the AST method discovers a larger and more expressive subset of the failure space when compared to the original AST formulation. We show that our approach is able to identify useful failure scenarios of an autonomous vehicle policy.
- North America > United States > Illinois (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > India (0.04)