Reinforcement Learning
The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data
Gu, Tu, Feng, Kaiyu, Cong, Gao, Long, Cheng, Wang, Zheng, Wang, Sheng
Despite the success of these learned indices in improving the performance Learned indices have been proposed to replace classic index structures of some types of queries, they still have various limitations, like B-Tree with machine learning (ML) models. They require e.g., they can only handle spatial point objects and limited types to replace both the indices and query processing algorithms currently of spatial queries, some only return approximate query results, deployed by the databases, and such a radical departure is and they either cannot handle updates or need a periodic rebuild likely to encounter challenges and obstacles. In contrast, we propose to retain high query efficiency (Detailed discussions are in Section a fundamentally different way of using ML techniques to 2). These limitations, together with the requirement that the improve on the query performance of the classic R-Tree without learned indices need a replacement of the index structures and the need of changing its structure or query processing algorithms.
Learning Human Rewards by Inferring Their Latent Intelligence Levels in Multi-Agent Games: A Theory-of-Mind Approach with Application to Driving Data
Tian, Ran, Tomizuka, Masayoshi, Sun, Liting
Reward function, as an incentive representation that recognizes humans' agency and rationalizes humans' actions, is particularly appealing for modeling human behavior in human-robot interaction. Inverse Reinforcement Learning is an effective way to retrieve reward functions from demonstrations. However, it has always been challenging when applying it to multi-agent settings since the mutual influence between agents has to be appropriately modeled. To tackle this challenge, previous work either exploits equilibrium solution concepts by assuming humans as perfectly rational optimizers with unbounded intelligence or pre-assigns humans' interaction strategies a priori. In this work, we advocate that humans are bounded rational and have different intelligence levels when reasoning about others' decision-making process, and such an inherent and latent characteristic should be accounted for in reward learning algorithms. Hence, we exploit such insights from Theory-of-Mind and propose a new multi-agent Inverse Reinforcement Learning framework that reasons about humans' latent intelligence levels during learning. We validate our approach in both zero-sum and general-sum games with synthetic agents and illustrate a practical application to learning human drivers' reward functions from real driving data. We compare our approach with two baseline algorithms. The results show that by reasoning about humans' latent intelligence levels, the proposed approach has more flexibility and capability to retrieve reward functions that explain humans' driving behaviors better.
Greedy Multi-step Off-Policy Reinforcement Learning
Wang, Yuhui, He, Pengcheng, Tan, Xiaoyang
Multi-step off-policy reinforcement learning has achieved great success. However, existing multi-step methods usually impose a fixed prior on the bootstrap steps, while the off-policy methods often require additional correction, suffering from certain undesired effects. In this paper, we propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. The new method has two desired properties:1) it can flexibly adjust the bootstrap step based on the quality of the data and the learned value function; 2) it can safely and robustly utilize data from arbitrary behavior policy without additional correction, whatever its quality or "off-policyness". We analyze the theoretical properties of the related operator, showing that it is able to converge to the global optimal value function, with a ratio faster than the traditional Bellman Optimality Operator. Furthermore, based on this new operator, we derive new model-free RL algorithms named Greedy Multi-Step Q Learning (and Greedy Multi-step DQN). Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance on a series of standard benchmark datasets.
A New Artificial Intelligence Makes Mistakes--on Purpose - AI Summary
The AI chess program, known as Maia, uses the kind of cutting-edge AI behind the best superhuman chess-playing programs. Alpha Zero broke from conventional AI chess programs by having computers learn, independent of any human instruction, how to play the game expertly. For chess, Alpha Zero is fed board positions and moves generated in practice games, and it tunes its neurons' firing to favor winning moves, an approach known as reinforcement learning . The result is a chess program capable of playing in a more human way. Well before that, AI that can predict and mimic human behavior could have immediate applications in chess and other games.
Machine learning offers fresh approach to tackling SQL injection vulnerabilities
UPDATED A new machine learning technique could make it easier for penetration testers to find SQL injection exploits in web applications. Introduced in a recently published paper by researchers at the University of Oslo, the method uses reinforcement learning to automate the process of exploiting a known SQL injection vulnerability. While the technique comes with quite a few caveats and assumptions, it provides a promising path toward developing machine learning models that can assist in penetration testing and security assessment tasks. Reinforcement learning is a branch of machine learning in which an AI model is given the possible actions and rewards of an environment and is left to find the best ways to apply those actions to maximize the reward. "It's inevitable that AI and machine learning are also applied in offensive security," Laszlo Erdodi, lead author of the paper and postdoctoral fellow at the department of informatics at the University of Oslo, told The Daily Swig.
Learning Collision-free and Torque-limited Robot Trajectories based on Alternative Safe Behaviors
Kiemel, Jonas C., Kröger, Torsten
This paper presents an approach to learn online generation of collision-free and torque-limited trajectories for industrial robots. A neural network, which is trained via reinforcement learning, is periodically invoked to predict future motions. For each robot joint, the network outputs the kinematic state that is desired at the end of the current time interval. Compliance with kinematic joint limits is ensured by the design of the action space. Given the current kinematic state and the network prediction, a trajectory for the current time interval can be computed. The main idea of our paper is to execute the predicted motion only if a collision-free and torque-limited way to continue the trajectory is known. In practice, the predicted motion is expanded by a braking trajectory and simulated using a physics engine. If the simulated trajectory complies with all safety constraints, the predicted motion is carried out. Otherwise, the braking trajectory calculated in the previous decision step serves as an alternative safe behavior. For evaluation, up to three simulated robots are trained to reach as many randomly placed target points as possible. We show that our method reliably prevents collisions with static obstacles and collisions between the robots, while generating motions that respect both torque limits and kinematic joint limits. Experiments with a real robot demonstrate that safe trajectories can be generated in real-time.
Causal Reinforcement Learning: An Instrumental Variable Approach
Li, Jin, Luo, Ye, Zhang, Xiaowei
In the standard data analysis framework, data is first collected (once for all), and then data analysis is carried out. With the advancement of digital technology, decisionmakers constantly analyze past data and generate new data through the decisions they make. In this paper, we model this as a Markov decision process and show that the dynamic interaction between data generation and data analysis leads to a new type of bias -- reinforcement bias -- that exacerbates the endogeneity problem in standard data analysis. We propose a class of instrument variable (IV)-based reinforcement learning (RL) algorithms to correct for the bias and establish their asymptotic properties by incorporating them into a two-timescale stochastic approximation framework. A key contribution of the paper is the development of new techniques that allow for the analysis of the algorithms in general settings where noises feature time-dependency. We use the techniques to derive sharper results on finite-time trajectory stability bounds: with a polynomial rate, the entire future trajectory of the iterates from the algorithm fall within a ball that is centered at the true parameter and is shrinking at a (different) polynomial rate. We also use the technique to provide formulas for inferences that are rarely done for RL algorithms. These formulas highlight how the strength of the IV and the degree of the noise's time dependency affect the inference.
Off-Belief Learning
Hu, Hengyuan, Lerer, Adam, Cui, Brandon, Pineda, Luis, Wu, David, Brown, Noam, Foerster, Jakob
The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and rely on multi-step counterfactual reasoning based on assumptions about other agents' actions and thus fail when paired with humans or independently trained agents. In contrast, no current methods can learn optimal policies that are fully grounded, i.e., do not rely on counterfactual information from observing other agents' actions. To address this, we present off-belief learning} (OBL): at each time step OBL agents assume that all past actions were taken by a given, fixed policy ($\pi_0$), but that future actions will be taken by an optimal policy under these same assumptions. When $\pi_0$ is uniform random, OBL learns the optimal grounded policy. OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next. This introduces counterfactual reasoning in a controlled manner. Unlike independent RL which may converge to any equilibrium policy, OBL converges to a unique policy, making it more suitable for zero-shot coordination. OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.
Passing Through Narrow Gaps with Deep Reinforcement Learning
Tidd, Brendan, Cosgun, Akansel, Leitner, Jurgen, Hudson, Nicolas
The DARPA subterranean challenge requires teams of robots to traverse difficult and diverse underground environments. Traversing small gaps is one of the challenging scenarios that robots encounter. Imperfect sensor information makes it difficult for classical navigation methods, where behaviours require significant manual fine tuning. In this paper we present a deep reinforcement learning method for autonomously navigating through small gaps, where contact between the robot and the gap may be required. We first learn a gap behaviour policy to get through small gaps (only centimeters wider than the robot). We then learn a goal-conditioned behaviour selection policy that determines when to activate the gap behaviour policy. We train our policies in simulation and demonstrate their effectiveness with a large tracked robot in simulation and on the real platform. In simulation experiments, our approach achieves 93% success rate when the gap behaviour is activated manually by an operator, and 67% with autonomous activation using the behaviour selection policy. In real robot experiments, our approach achieves a success rate of 73% with manual activation, and 40% with autonomous behaviour selection. While we show the feasibility of our approach in simulation, the difference in performance between simulated and real world scenarios highlight the difficulty of direct sim-to-real transfer for deep reinforcement learning policies. In both the simulated and real world environments alternative methods were unable to traverse the gap.
Bayesian Meta-Learning for Few-Shot Policy Adaptation Across Robotic Platforms
Ghadirzadeh, Ali, Chen, Xi, Poklukar, Petra, Finn, Chelsea, Björkman, Mårten, Kragic, Danica
Reinforcement learning methods can achieve significant performance but require a large amount of training data collected on the same robotic platform. A policy trained with expensive data is rendered useless after making even a minor change to the robot hardware. In this paper, we address the challenging problem of adapting a policy, trained to perform a task, to a novel robotic hardware platform given only few demonstrations of robot motion trajectories on the target robot. We formulate it as a few-shot meta-learning problem where the goal is to find a meta-model that captures the common structure shared across different robotic platforms such that data-efficient adaptation can be performed. We achieve such adaptation by introducing a learning framework consisting of a probabilistic gradient-based meta-learning algorithm that models the uncertainty arising from the few-shot setting with a low-dimensional latent variable. We experimentally evaluate our framework on a simulated reaching and a real-robot picking task using 400 simulated robots generated by varying the physical parameters of an existing set of robotic platforms. Our results show that the proposed method can successfully adapt a trained policy to different robotic platforms with novel physical parameters and the superiority of our meta-learning algorithm compared to state-of-the-art methods for the introduced few-shot policy adaptation problem.