AITopics

2006.09497

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Sachdeva, Noveen, Su, Yi, Joachims, Thorsten

Off-policy Bandits with Deficient Support

arXiv.org Machine LearningJun-16-2020

Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data. State-of-the-art methods for such off-policy learning, however, are based on inverse propensity score (IPS) weighting. A key theoretical requirement of IPS weighting is that the policy that logged the data has "full support", which typically translates into requiring non-zero probability for any action in any context. Unfortunately, many real-world systems produce support deficient data, especially when the action space is large, and we show how existing methods can fail catastrophically. To overcome this gap between theory and applications, we identify three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We systematically analyze the statistical and computational properties of these three approaches, and we empirically evaluate their effectiveness. In addition to providing the first systematic analysis of support-deficiency in contextual-bandit learning, we conclude with recommendations that provide practical guidance.

machine learning, reinforcement learning, support deficiency, (17 more...)

arXiv.org Machine Learning

doi: 10.1145/3394486.3403139

2006.09438

Country:

North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > India > Telangana > Hyderabad (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

arXiv.org Machine LearningJun-16-2020

$Q$-learning with Logarithmic Regret

Yang, Kunhe, Yang, Lin F., Du, Simon S.

Q-learning [Watkins and Dayan, 1992] is one of the most popular classes of methods for solving reinforcement learning (RL) problems. Q-learning tries to estimate the optimal state-action value function (Q-function). With a Q-function, at every state, one can greedily choose the action with the largest Q value to interact with the RL environment while achieving near optimal expected cumulative rewards in the long run. Compared to another popular classes of methods, e.g., modelbased RL, Q-learning algorithms (or more generally, model-free algorithms) often enjoy better memory and time efficiency

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

2006.09118

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Learn to Effectively Explore in Context-Based Meta-RL

Zhang, Jin, Wang, Jianhao, Hu, Hao, Chen, Yingfeng, Fan, Changjie, Zhang, Chongjie

Meta reinforcement learning (meta-RL) provides a principled approach for fast adaptation to novel tasks by extracting prior knowledge from previous tasks. Under such settings, it is crucial for the agent to perform efficient exploration during adaptation to collect useful experiences. However, existing methods suffer from poor adaptation performance caused by inefficient exploration mechanisms, especially in sparse-reward problems. In this paper, we present a novel off-policy context-based meta-RL approach that efficiently learns a separate exploration policy to support fast adaptation, as well as a context-aware exploitation policy to maximize extrinsic return. The explorer is motivated by an information-theoretical intrinsic reward that encourages the agent to collect experiences that provide rich information about the task. Experiment results on both MuJoCo and Meta-World benchmarks show that our method significantly outperforms baselines by performing efficient exploration strategies.

deep learning, exploration, upstream oil & gas, (20 more...)

2006.0817

Genre: Research Report (0.50)

Industry:

Energy > Oil & Gas > Upstream (0.50)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Cideron, Geoffrey, Pierrot, Thomas, Perrin, Nicolas, Beguir, Karim, Sigaud, Olivier

QD-RL: Efficient Mixing of Quality and Diversity in Reinforcement Learning

We propose a novel reinforcement learning algorithm,QD-RL, that incorporates the strengths of off-policy RL algorithms into Quality Diversity (QD) approaches. Quality-Diversity methods contribute structural biases by decoupling the search for diversity from the search for high return, resulting in efficient management of the exploration-exploitation trade-off. However, these approaches generally suffer from sample inefficiency as they call upon evolutionary techniques. QD-RL removes this limitation by relying on off-policy RL algorithms. More precisely, we train a population of off-policy deep RL agents to simultaneously maximize diversity inside the population and the return of the agents. QD-RL selects agents from the diversity-return Pareto Front, resulting in stable and efficient population updates. Our experiments on the Ant-Maze environment show that QD-RL can solve challenging exploration and control problems with deceptive rewards while being more than 15 times more sample efficient than its evolutionary counterparts.

artificial intelligence, diversity, upstream oil & gas, (18 more...)

2006.08505

Genre: Research Report > New Finding (0.68)

Industry:

Energy > Oil & Gas > Upstream (0.49)
Leisure & Entertainment > Games (0.46)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Model-based Adversarial Meta-Reinforcement Learning

Lin, Zichuan, Thomas, Garrett, Yang, Guangwen, Ma, Tengyu

Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2006.08875

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Texas (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Index Selection for NoSQL Database with Deep Reinforcement Learning

Yao, Shun, Wang, Hongzhi, Yan, Yu

We propose a new approach of NoSQL database index selection. For different workloads, we select different indexes and their different parameters to optimize the database performance. The approach builds a deep reinforcement learning model to select an optimal index for a given fixed workload and adapts to a changing workload. Experimental results show that, Deep Reinforcement Learning Index Selection Approach (DRLISA) has improved performance to varying degrees according to traditional single index structures.

machine learning, reinforcement, reinforcement learning, (17 more...)

2006.08842

Country:

Asia > China > Heilongjiang Province > Harbin (0.05)
North America > United States > Texas > El Paso County > El Paso (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Formal Verification of End-to-End Learning in Cyber-Physical Systems: Progress and Challenges

Fulton, Nathan, Hunt, Nathan, Hoang, Nghia, Das, Subhro

Autonomous systems - such as cars, planes, and trains - must come with strong safety guarantees. These systems are cyber-physical, in the sense that their safety depends crucially upon the way in which their software ("cyber") components interact with their kinetic components. Cyber-physical systems (CPS) analysis tools can verify the safety of CPS by stating correctness specifications in a formal language and then verifying - via computer-checked proof - that safety-critical software components respect these specifications. Existing approaches toward formally verifying the correctness of cyber-physical systems focus primarily on constructing formal safety proofs about classical low-dimensional models of control systems. For example, the safety of an adaptive cruise control system might be established by modeling the dynamics of two cars in terms of their positions and velocities and then proving that a control policy preserves safe separation between all cars on the road for any time horizon [15]. Researchers have employed a similar approach for ensuring the correctness of proposed FAA aircraft collision avoidance protocols [12], the European Train Control System [20], and quadcopters [21]. These proofs are typically constructed and checked using a cyber-physical systems verification tool such as Flow* [4], KeYmaera X [8], or SpaceEx [6]. CPS verification tools can provide very strong safety guarantees for cyber-physical systems, but typical techniques for using these tools rely on three assumptions that break down when applying verification techniques to real autonomous systems: 1. CPS verification techniques assume that a symbolic representation of the state of the world is known a priori. For example, formal CPS models of ground robots typically assume that the system knows the positions of all relevant obstacles, at least within some error bound [16].

constraint, logic & formal reasoning, machine learning, (17 more...)

2006.09181

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Transportation > Passenger (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.90)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.69)

Preference-based Reinforcement Learning with Finite-Time Guarantees

Xu, Yichong, Wang, Ruosong, Yang, Lin F., Singh, Aarti, Dubrawski, Artur

Preference-based Reinforcement Learning (PbRL) replaces reward values in traditional reinforcement learning by preferences to better elicit human opinion on the target objective, especially when numerical reward values are hard to design or interpret. Despite promising results in applications, the theoretical understanding of PbRL is still in its infancy. In this paper, we present the first finite-time analysis for general PbRL problems. We first show that a unique optimal policy may not exist if preferences over trajectories are deterministic for PbRL. If preferences are stochastic, and the preference probability relates to the hidden reward values, we present algorithms for PbRL, both with and without a simulator, that are able to identify the best policy up to accuracy $\varepsilon$ with high probability. Our method explores the state space by navigating to under-explored states, and solves PbRL using a combination of dueling bandits and policy search. Experiments show the efficacy of our method when it is applied to real-world problems.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2006.0891

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Zhao, Mingde

Temporal-Difference (TD) learning is a standard and very successful reinforcement learning approach, at the core of both algorithms that learn the value of a given policy, as well as algorithms which learn how to improve policies. TD-learning with eligibility traces provides a way to do temporal credit assignment, i.e. decide which portion of a reward should be assigned to predecessor states that occurred at different previous times, controlled by a parameter $\lambda$. However, tuning this parameter can be time-consuming, and not tuning it can lead to inefficient learning. To improve the sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner. The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online, incurring roughly the same computational complexity per step as the usual value learner. Our approach can be used both in on-policy and off-policy learning. We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error. This method can be viewed as a plugin which can also be used to assist prediction with function approximation by meta-learning feature (observation)-based $\lambda$ online, or even in the control case to assist policy improvement. Our empirical evaluation demonstrates significant performance improvements, as well as improved robustness of the proposed algorithm to learning rate variation.

machine learning, reinforcement learning, update target, (17 more...)

2006.08906

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)