AITopics

We propose a model-based lifelong reinforcement-learning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a sample-based Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuous-state domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent model-based deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than state-of-the-art lifelong RL methods.

machine learning, posterior, reinforcement learning, (19 more...)

2210.11579

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Transferring Dexterous Manipulation from GPU Simulation to a Remote Real-World TriFinger

Allshire, Arthur, Mittal, Mayank, Lodaya, Varun, Makoviychuk, Viktor, Makoviichuk, Denys, Widmaier, Felix, Wüthrich, Manuel, Bauer, Stefan, Handa, Ankur, Garg, Animesh

We present a system for learning a challenging dexterous manipulation task involving moving a cube to an arbitrary 6-DoF pose with only 3-fingers trained with NVIDIA's IsaacGym simulator. We show empirical benefits, both in simulation and sim-to-real transfer, of using keypoints as opposed to position+quaternion representations for the object pose in 6-DoF for policy observations and in reward calculation to train a model-free reinforcement learning agent. By utilizing domain randomization strategies along with the keypoint representation of the pose of the manipulated object, we achieve a high success rate of 83% on a remote TriFinger system maintained by the organizers of the Real Robot Challenge. With the aim of assisting further research in learning in-hand manipulation, we make the codebase of our system, along with trained checkpoints that come with billions of steps of experience available, at https://s2r2-ig.github.io

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2108.09779

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report (0.52)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Pattanayak, Kunal, Krishnamurthy, Vikram, Berry, Christopher

How can a Radar Mask its Cognition?

A cognitive radar is a constrained utility maximizer that adapts its sensing mode in response to a changing environment. If an adversary can estimate the utility function of a cognitive radar, it can determine the radar's sensing strategy and mitigate the radar performance via electronic countermeasures (ECM). This paper discusses how a cognitive radar can {\em hide} its strategy from an adversary that detects cognition. The radar does so by transmitting purposefully designed sub-optimal responses to spoof the adversary's Neyman-Pearson detector. We provide theoretical guarantees by ensuring the Type-I error probability of the adversary's detector exceeds a pre-defined level for a specified tolerance on the radar's performance loss. We illustrate our cognition masking scheme via numerical examples involving waveform adaptation and beam allocation. We show that small purposeful deviations from the optimal strategy of the radar confuse the adversary by significant amounts, thereby masking the radar's cognition. Our approach uses novel ideas from revealed preference in microeconomics and adversarial inverse reinforcement learning. Our proposed algorithms provide a principled approach for system-level electronic counter-countermeasures (ECCM) to mask the radar's cognition, i.e., hide the radar's strategy from an adversary. We also provide performance bounds for our cognition masking scheme when the adversary has misspecified measurements of the radar's response.

machine learning, radar, reinforcement learning, (19 more...)

2210.11444

Country:

North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > United States > New Jersey > Camden County > Cherry Hill (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)
(2 more...)

Brandenburger, André, Hoffmann, Folker, Charlish, Alexander

Co-Training an Observer and an Evading Target

Reinforcement learning (RL) is already widely applied to applications such as robotics, but it is only sparsely used in sensor management. In this paper, we apply the popular Proximal Policy Optimization (PPO) approach to a multi-agent UAV tracking scenario. While recorded data of real scenarios can accurately reflect the real world, the required amount of data is not always available. Simulation data, however, is typically cheap to generate, but the utilized target behavior is often naive and only vaguely represents the real world. In this paper, we utilize multi-agent RL to jointly generate protagonistic and antagonistic policies and overcome the data generation problem, as the policies are generated on-the-fly and adapt continuously. This way, we are able to clearly outperform baseline methods and robustly generate competitive policies. In addition, we investigate explainable artificial intelligence (XAI) by interpreting feature saliency and generating an easy-to-read decision tree as a simplified policy.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

doi: 10.23919/FUSION49465.2021.9627024

2210.11126

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Massachusetts (0.04)
(6 more...)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
(2 more...)

Gladin, Egor, Lavrik-Karmazin, Maksim, Zainullina, Karina, Rudenko, Varvara, Gasnikov, Alexander, Takáč, Martin

Algorithm for Constrained Markov Decision Process with Linear Convergence

The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.

algorithm, artificial intelligence, machine learning, (14 more...)

2206.01666

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.85)

Learning Preferences for Interactive Autonomy

Bıyık, Erdem

When robots enter everyday human environments, they need to understand their tasks and how they should perform those tasks. To encode these, reward functions, which specify the objective of a robot, are employed. However, designing reward functions can be extremely challenging for complex tasks and environments. A promising approach is to learn reward functions from humans. Recently, several robot learning works embrace this approach and leverage human demonstrations to learn the reward functions. Known as inverse reinforcement learning, this approach relies on a fundamental assumption: humans can provide near-optimal demonstrations to the robot. Unfortunately, this is rarely the case: human demonstrations to the robot are often suboptimal due to various reasons, e.g., difficulty of teleoperation, robot having high degrees of freedom, or humans' cognitive limitations. This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities. Specifically, we study how reward functions can be learned using comparative feedback, in which the human user compares multiple robot trajectories instead of (or in addition to) providing demonstrations. To this end, we first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function, which may be parametric or non-parametric. Next, we propose active learning techniques to enable the robot to ask for comparison feedback that optimizes for the expected information that will be gained from that user feedback. Finally, we demonstrate the applicability of our methods in a wide variety of domains, ranging from autonomous driving simulations to home robotics, from standard reinforcement learning benchmarks to lower-body exoskeletons.

artificial intelligence, bayesian inference, machine learning, (21 more...)

2210.10899

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > Massachusetts > Plymouth County > Hanover (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Republic of Türkiye (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Transportation > Ground > Road (1.00)
Health & Medicine (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(5 more...)

DIAMBRA Arena: a New Reinforcement Learning Platform for Research and Experimentation

Palmas, Alessandro

The recent advances in reinforcement learning have led to effective methods able to obtain above human-level performances in very complex environments. However, once solved, these environments become less valuable, and new challenges with different or more complex scenarios are needed to support research advances. This work presents DIAMBRA Arena, a new platform for reinforcement learning research and experimentation, featuring a collection of high-quality environments exposing a Python API fully compliant with OpenAI Gym standard. They are episodic tasks with discrete actions and observations composed by raw pixels plus additional numerical values, all supporting both single player and two players mode, allowing to work on standard reinforcement learning, competitive multi-agent, human-agent competition, self-play, human-in-the-loop training and imitation learning. Software capabilities are demonstrated by successfully training multiple deep reinforcement learning agents with proximal policy optimization obtaining human-like behavior. Results confirm the utility of DIAMBRA Arena as a reinforcement learning research tool, providing environments designed to study some of the most challenging topics in the field.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2210.10595

Country:

North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre:

Overview (0.67)
Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Information Technology (0.68)
Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Voloshin, Cameron, Le, Hoang M., Chaudhuri, Swarat, Yue, Yisong

Policy Optimization with Linear Temporal Logic Constraints

We study the problem of policy optimization (PO) with linear temporal logic (LTL) constraints. The language of LTL allows flexible description of tasks that may be unnatural to encode as a scalar cost function. We consider LTL-constrained PO as a systematic framework, decoupling task specification from policy selection, and as an alternative to the standard of cost shaping. With access to a generative model, we develop a model-based approach that enjoys a sample complexity analysis for guaranteeing both task satisfaction and cost optimality (through a reduction to a reachability problem). Empirically, our algorithm can achieve strong performance even in low-sample regimes.

amec, artificial intelligence, machine learning, (18 more...)

2206.09546

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Michigan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre:

Research Report (0.63)
Overview (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Hierarchical Reinforcement Learning for Furniture Layout in Virtual Indoor Scenes

Di, Xinhan, Yu, Pengqian

In real life, the decoration of 3D indoor scenes through designing furniture layout provides a rich experience for people. In this paper, we explore the furniture layout task as a Markov decision process (MDP) in virtual reality, which is solved by hierarchical reinforcement learning (HRL). The goal is to produce a proper two-furniture layout in the virtual reality of the indoor scenes. In particular, we first design a simulation environment and introduce the HRL formulation for a two-furniture layout. We then apply a hierarchical actor-critic algorithm with curriculum learning to solve the MDP. We conduct our experiments on a large-scale real-world interior layout dataset that contains industrial designs from professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts as compared with the state-of-art models.

furniture, machine learning, reinforcement learning, (14 more...)

2210.10431

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Singapore (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Leisure & Entertainment > Games (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Deep neural network expressivity for optimal stopping problems

Gonon, Lukas

This article studies deep neural network expression rates for optimal stopping problems of discrete-time Markov processes on high-dimensional state spaces. A general framework is established in which the value function and continuation value of an optimal stopping problem can be approximated with error at most $\varepsilon$ by a deep ReLU neural network of size at most $\kappa d^{\mathfrak{q}} \varepsilon^{-\mathfrak{r}}$. The constants $\kappa,\mathfrak{q},\mathfrak{r} \geq 0$ do not depend on the dimension $d$ of the state space or the approximation accuracy $\varepsilon$. This proves that deep neural networks do not suffer from the curse of dimensionality when employed to solve optimal stopping problems. The framework covers, for example, exponential L\'evy models, discrete diffusion processes and their running minima and maxima. These results mathematically justify the use of deep neural networks for numerically solving optimal stopping problems and pricing American options in high dimensions.

artificial intelligence, assumption 1, machine learning, (19 more...)

2210.10443

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.70)

Industry: Banking & Finance (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)