AITopics

This paper proposes a novel and scalable reinforcement learning approach for simultaneous routing and spectrum access in wireless ad-hoc networks. In most previous works on reinforcement learning for network optimization, routing and spectrum access are tackled as separate tasks; further, the wireless links in the network are assumed to be fixed, and a different agent is trained for each transmission node -- this limits scalability and generalizability. In this paper, we account for the inherent signal-to-interference-plus-noise ratio (SINR) in the physical layer and propose a more scalable approach in which a single agent is associated with each flow. Specifically, a single agent makes all routing and spectrum access decisions as it moves along the frontier nodes of each flow. The agent is trained according to the physical layer characteristics of the environment using the future bottleneck SINR as a novel reward definition. This allows a highly effective routing strategy based on the geographic locations of the nodes in the wireless ad-hoc network. The proposed deep reinforcement learning strategy is capable of accounting for the mutual interference between the links. It learns to avoid interference by intelligently allocating spectrum slots and making routing decisions for the entire network in a scalable manner.

agent, frequency band, node, (17 more...)

2012.11783

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Telecommunications (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Liu, Yuejiang, Yan, Qi, Alahi, Alexandre

Social NCE: Contrastive Learning of Socially-aware Motion Representations

Learning socially-aware motion representations is at the core of recent advances in human trajectory forecasting and robot navigation in crowded spaces. Yet existing methods often struggle to generalize to challenging scenarios and even output unacceptable solutions (e.g., collisions). In this work, we propose to address this issue via contrastive learning. Concretely, we introduce a social contrastive loss that encourages the encoded motion representation to preserve sufficient information for distinguishing a positive future event from a set of negative ones. We explicitly draw these negative samples based on our domain knowledge about socially unfavorable scenarios in the multi-agent context. Experimental results show that the proposed method consistently boosts the performance of previous trajectory forecasting, behavioral cloning, and reinforcement learning algorithms in various settings. Our method makes little assumptions about neural architecture designs, and hence can be used as a generic way to incorporate negative data augmentation into motion representation learning.

arxiv, international conference, learning, (15 more...)

2012.11717

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Transportation (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.89)

Bandi, Nymisha, Tulabandhula, Theja

Off-Policy Optimization of Portfolio Allocation Policies under Constraints

The dynamic portfolio optimization problem in finance frequently requires learning policies that adhere to various constraints, driven by investor preferences and risk. We motivate this problem of finding an allocation policy within a sequential decision making framework and study the effects of: (a) using data collected under previously employed policies, which may be sub-optimal and constraint-violating, and (b) imposing desired constraints while computing near-optimal policies with this data. Our framework relies on solving a minimax objective, where one player evaluates policies via off-policy estimators, and the opponent uses an online learning strategy to control constraint violations. We extensively investigate various choices for off-policy estimation and their corresponding optimization sub-routines, and quantify their impact on computing constraint-aware allocation policies. Our study shows promising results for constructing such policies when back-tested on historical equities data, under various regimes of operation, dimensionality and constraints.

constraint, investment, portfolio, (14 more...)

2012.11715

Genre: Research Report (0.64)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Rafailov, Rafael, Yu, Tianhe, Rajeswaran, Aravind, Finn, Chelsea

Offline Reinforcement Learning from Images with Latent Space Models

Offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions. Offline RL enables extensive use and re-use of historical datasets, while also alleviating safety concerns associated with online exploration, thereby expanding the real-world applicability of RL. Most prior work in offline RL has focused on tasks with compact state representations. However, the ability to learn directly from rich observation spaces like images is critical for real-world applications such as robotics. In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces. Model-based offline RL algorithms have achieved state of the art results in state based tasks and have strong theoretical guarantees. However, they rely crucially on the ability to quantify uncertainty in the model predictions, which is particularly challenging with image observations. To overcome this challenge, we propose to learn a latent-state dynamics model, and represent the uncertainty in the latent space. Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP. In experiments on a range of challenging image-based locomotion and manipulation tasks, we find that our algorithm significantly outperforms previous offline model-free RL methods as well as state-of-the-art online visual model-based RL methods. Moreover, we also find that our approach excels on an image-based drawer closing task on a real robot using a pre-existing dataset. All results including videos can be found online at https://sites.google.com/view/lompo/ .

algorithm, arxiv preprint arxiv, dataset, (11 more...)

2012.11547

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Menictas, Marianne, Tomkins, Sabina, Murphy, Susan

Fast Physical Activity Suggestions: Efficient Hyperparameter Learning in Mobile Health

arXiv.org Machine LearningDec-21-2020

Users can be supported to adopt healthy behaviors, such as regular physical activity, via relevant and timely suggestions on their mobile devices. Recently, reinforcement learning algorithms have been found to be effective for learning the optimal context under which to provide suggestions. However, these algorithms are not necessarily designed for the constraints posed by mobile health (mHealth) settings, that they be efficient, domain-informed and computationally affordable. We propose an algorithm for providing physical activity suggestions in mHealth settings. Using domain-science, we formulate a contextual bandit algorithm which makes use of a linear mixed effects model. We then introduce a procedure to efficiently perform hyper-parameter updating, using far less computational resources than competing approaches. Not only is our approach computationally efficient, it is also easily implemented with closed form matrix algebraic updates and we show improvements over state of the art approaches both in speed and accuracy of up to 99% and 56% respectively.

algorithm, random effect, variance component, (11 more...)

arXiv.org Machine Learning

2012.11646

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.54)
Information Technology > Data Science > Data Mining > Big Data (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

arXiv.org Machine LearningDec-21-2020

SIBRE: Self Improvement Based REwards for Adaptive Feedback in Reinforcement Learning

Nath, Somjit, Verma, Richa, Ray, Abhik, Khadilkar, Harshad

We propose a generic reward shaping approach for improving the Similar approaches appear to have worked in literature on container rate of convergence in reinforcement learning (RL), called Self loading [27] and railway scheduling [11] problems, without Improvement Based REwards, or SIBRE. The approach is designed being formally proposed or analysed. One study on bin packing for use in conjunction with any existing RL algorithm, and consists does propose reward shaping explicitly, and is described below. of rewarding improvement over the agent's own past performance. Literature on formal reward shaping: The proposed approach We prove that SIBRE converges in expectation under the same (SIBRE) falls under the category of reward shaping approaches conditions as the original RL algorithm. The reshaped rewards for RL, but with some key novelty points as described help discriminate between policies when the original rewards are below. Prior literature has shown that the optimal policy learnt weakly discriminated or sparse. Experiments on several well-known by RL remains invariant under reward shaping if the modification benchmark environments with different RL algorithms show that can be expressed as a potential function [15].

agent, algorithm, sibre, (16 more...)

arXiv.org Machine Learning

2004.09846

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceDec-20-2020

Reinforcement Learning-based Product Delivery Frequency Control

Liu, Yang, Chen, Zhengxing, Virochsiri, Kittipat, Wang, Juan, Wu, Jiahao, Liang, Feng

Frequency control is an important problem in modern recommender systems. It dictates the delivery frequency of recommendations to maintain product quality and efficiency. For example, the frequency of delivering promotional notifications impacts daily metrics as well as the infrastructure resource consumption (e.g. CPU and memory usage). There remain open questions on what objective we should optimize to represent business values in the long term best, and how we should balance between daily metrics and resource consumption in a dynamically fluctuating environment. We propose a personalized methodology for the frequency control problem, which combines long-term value optimization using reinforcement learning (RL) with a robust volume control technique we termed "Effective Factor". We demonstrate statistically significant improvement in daily metrics and resource efficiency by our method in several notification applications at a scale of billions of users. To our best knowledge, our study represents the first deep RL application on the frequency control problem at such an industrial scale.

consumption, frequency, resource consumption, (13 more...)

2012.10858

Country: North America > United States > California > San Mateo County > Menlo Park (0.04)

Genre: Research Report (0.64)

Industry:

Marketing (0.68)
Information Technology > Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Tao, Lingfeng, Bowman, Michael, Zhang, Jiucai, Zhang, Xiaoli

Forming Human-Robot Cooperation for Tasks with General Goal using Evolutionary Value Learning

arXiv.org Artificial IntelligenceDec-19-2020

In human-robot cooperation, the robot cooperates with the human to accomplish the task together. Existing approaches assume the human has a specific goal during the cooperation, and the robot infers and acts toward it. However, in real-world environments, a human usually only has a general goal (e.g., general direction or area in motion planning) at the beginning of the cooperation which needs to be clarified to a specific goal (e.g., an exact position) during cooperation. The specification process is interactive and dynamic, which depends on the environment and the behavior of the partners. The robot that does not consider the goal specification process may cause frustration to the human partner, elongate the time to come to an agreement, and compromise or fail team performance. We present Evolutionary Value Learning (EVL) approach which uses a State-based Multivariate Bayesian Inference method to model the dynamics of goal specification process in HRC, and an Evolutionary Value Updating method to actively enhance the process of goal specification and cooperation formation. This enables the robot to simultaneously help the human to specify the goal and learn a cooperative policy in a Reinforcement Learning manner. In experiments with real human subjects, the robot equipped with EVL outperforms existing methods with faster goal specification processes and better team performance.

goal specification process, robot, specification process, (14 more...)

2012.10773

Country:

North America > United States > Colorado > Jefferson County > Golden (0.14)
North America > United States > Illinois (0.04)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.90)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Artificial IntelligenceDec-19-2020

Model-Based Actor-Critic with Chance Constraint for Stochastic System

Peng, Baiyu, Mu, Yao, Guan, Yang, Li, Shengbo Eben, Yin, Yuming, Chen, Jianyu

Safety constraints are essential for reinforcement learning (RL) applied in real-world situations. Chance constraints are suitable to represent the safety requirements in stochastic systems. Most existing RL methods with chance constraints have a low convergence rate, and only learn a conservative policy. In this paper, we propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficiently learn a safe and non-conservative policy. Different from existing methods that optimize a conservative lower bound, CCAC directly solves the original chance constrained problems, where the objective function and safe probability is simultaneously optimized with adaptive weights. In order to improve the convergence rate, CCAC utilizes the gradient of dynamic model to accelerate policy optimization. The effectiveness of CCAC is demonstrated by an aggressive car-following task. Experiments indicate that compared with previous methods, CCAC improves the performance by 57.6% while guaranteeing safety, with a five times faster convergence rate.

chance constraint, constraint, safe probability, (14 more...)

2012.10716

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment > Games (0.46)
Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.46)

arXiv.org Artificial IntelligenceDec-19-2020

Multi-Decoder Attention Model with Embedding Glimpse for Solving Vehicle Routing Problems

Xin, Liang, Song, Wen, Cao, Zhiguang, Zhang, Jie

We present a novel deep reinforcement learning method to learn construction heuristics for vehicle routing problems. In specific, we propose a Multi-Decoder Attention Model (MDAM) to train multiple diverse policies, which effectively increases the chance of finding good solutions compared with existing methods that train only one policy. A customized beam search strategy is designed to fully exploit the diversity of MDAM. In addition, we propose an Embedding Glimpse layer in MDAM based on the recursive nature of construction, which can improve the quality of each policy by providing more informative embeddings. Extensive experiments on six different routing problems show that our method significantly outperforms the state-of-the-art deep learning based models.

decoder, van hoof, welling 2019, (15 more...)

2012.10638

Country:

Asia > Singapore (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Freight & Logistics Services (0.71)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)