Goto

Collaborating Authors

 Reinforcement Learning


Inverse Policy Evaluation for Value-based Sequential Decision-making

arXiv.org Artificial Intelligence

Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.


Auxiliary-task Based Deep Reinforcement Learning for Participant Selection Problem in Mobile Crowdsourcing

arXiv.org Machine Learning

In mobile crowdsourcing (MCS), the platform selects participants to complete location-aware tasks from the recruiters aiming to achieve multiple goals (e.g., profit maximization, energy efficiency, and fairness). However, different MCS systems have different goals and there are possibly conflicting goals even in one MCS system. Therefore, it is crucial to design a participant selection algorithm that applies to different MCS systems to achieve multiple goals. To deal with this issue, we formulate the participant selection problem as a reinforcement learning problem and propose to solve it with a novel method, which we call auxiliary-task based deep reinforcement learning (ADRL). We use transformers to extract representations from the context of the MCS system and a pointer network to deal with the combinatorial optimization problem. To improve the sample efficiency, we adopt an auxiliary-task training process that trains the network to predict the imminent tasks from the recruiters, which facilitates the embedding learning of the deep learning model. Additionally, we release a simulated environment on a specific MCS task, the ride-sharing task, and conduct extensive performance evaluations in this environment. The experimental results demonstrate that ADRL outperforms and improves sample efficiency over other well-recognized baselines in various settings.


t-Soft Update of Target Network for Deep Reinforcement Learning

arXiv.org Machine Learning

This paper proposes a new robust update rule of the target network for deep reinforcement learning, to replace the conventional update rule, given as an exponential moving average. The problem with the conventional rule is the fact that all the parameters are smoothly updated with the same speed, even when some of them are trying to update toward the wrong directions. To robustly update the parameters, the t-soft update, which is inspired by the student-t distribution, is derived with reference to the analogy between the exponential moving average and the normal distribution. In most of PyBullet robotics simulations, an online actor-critic algorithm with the t-soft update outperformed the conventional methods in terms of the obtained return.


An FPGA-Based On-Device Reinforcement Learning Approach using Online Sequential Learning

arXiv.org Machine Learning

DQN (Deep Q-Network) is a method to perform Q-learning for reinforcement learning using deep neural networks. DQNs require a large buffer and batch processing for an experience replay and rely on a backpropagation based iterative optimization, making them difficult to be implemented on resource-limited edge devices. In this paper, we propose a lightweight on-device reinforcement learning approach for low-cost FPGA devices. It exploits a recently proposed neural-network based on-device learning approach that does not rely on the backpropagation method but uses OS-ELM (Online Sequential Extreme Learning Machine) based training algorithm. In addition, we propose a combination of L2 regularization and spectral normalization for the on-device reinforcement learning so that output values of the neural network can be fit into a certain range and the reinforcement learning becomes stable. The proposed reinforcement learning approach is designed for Xilinx PYNQ-Z1 board as a low-cost FPGA platform. The evaluation results using OpenAI Gym demonstrate that the proposed algorithm and its FPGA implementation without data transfer overhead complete a CartPole-v0 task 29.76x and 125.88x faster than a conventional DQN-based approach when the number of hidden-layer nodes is 64.


DeepMind's Three Pillars for Building Robust Machine Learning Systems - KDnuggets

#artificialintelligence

I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Building machine learning systems differs from traditional software development in many aspects of its lifecycle. Established software methodologies for testing, debugging and troubleshooting result simply impractical when applied to machine learning models.


Dynamic Dispatching for Large-Scale Heterogeneous Fleet via Multi-agent Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Dynamic dispatching is one of the core problems for operation optimization in traditional industries such as mining, as it is about how to smartly allocate the right resources to the right place at the right time. Conventionally, the industry relies on heuristics or even human intuitions which are often short-sighted and sub-optimal solutions. Leveraging the power of AI and Internet of Things (IoT), data-driven automation is reshaping this area. However, facing its own challenges such as large-scale and heterogenous trucks running in a highly dynamic environment, it can barely adopt methods developed in other domains (e.g., ride-sharing). In this paper, we propose a novel Deep Reinforcement Learning approach to solve the dynamic dispatching problem in mining. We first develop an event-based mining simulator with parameters calibrated in real mines. Then we propose an experience-sharing Deep Q Network with a novel abstract state/action representation to learn memories from heterogeneous agents altogether and realizes learning in a centralized way. We demonstrate that the proposed methods significantly outperform the most widely adopted approaches in the industry by $5.56\%$ in terms of productivity. The proposed approach has great potential in a broader range of industries (e.g., manufacturing, logistics) which have a large-scale of heterogenous equipment working in a highly dynamic environment, as a general framework for dynamic resource allocation.


Improved Memories Learning

arXiv.org Machine Learning

We propose Improved Memories Learning (IMeL), a novel algorithm that turns reinforcement learning (RL) into a supervised learning (SL) problem and delimits the role of neural networks (NN) to interpolation. IMeL consists of two components. The first is a reservoir of experiences. Each experience is updated based on a non-parametric procedural improvement of the policy, computed as a bounded one-sample Monte Carlo estimate. The second is a NN regressor, which receives as input improved experiences from the reservoir (context points) and computes the policy by interpolation. The NN learns to measure the similarity between states in order to compute long-term forecasts by averaging experiences, rather than by encoding the problem structure in the NN parameters. We present preliminary results and propose IMeL as a baseline method for assessing the merits of more complex models and inductive biases.


Deep Reinforcement Learning 2.0

#artificialintelligence

Free Coupon Discount - Deep Reinforcement Learning 2.0, The smartest combination of Deep Q-Learning, Policy Gradient, Actor Critic, and DDPG Created by Hadelin de Ponteves, Kirill Eremenko, SuperDataScience Team Students also bought Natural Language Processing with Deep Learning in Python Recommender Systems and Deep Learning in Python Data Science: Natural Language Processing (NLP) in Python Deep Learning and Computer Vision A-Z: OpenCV, SSD & GANs The Complete Neural Networks Bootcamp: Theory, Applications Cutting-Edge AI: Deep Reinforcement Learning in Python Preview this Udemy Course GET COUPON CODE Description Welcome to Deep Reinforcement Learning 2.0! In this course, we will learn and implement a new incredibly smart AI model, called the Twin-Delayed DDPG, which combines state of the art techniques in Artificial Intelligence including continuous Double Deep Q-Learning, Policy Gradient, and Actor Critic. The model is so strong that for the first time in our courses, we are able to solve the most challenging virtual AI applications (training an ant/spider and a half humanoid to walk and run across a field). To approach this model the right way, we structured the course in three parts: Part 1: Fundamentals In this part we will study all the fundamentals of Artificial Intelligence which will allow you to understand and master the AI of this course. These include Q-Learning, Deep Q-Learning, Policy Gradient, Actor-Critic and more.


Learning Off-Policy with Online Planning

arXiv.org Artificial Intelligence

We propose Learning Off-Policy with Online Planning (LOOP), combining the techniques from model-based and model-free reinforcement learning algorithms. The agent learns a model of the environment, and then uses trajectory optimization with the learned model to select actions. To sidestep the myopic effect of fixed horizon trajectory optimization, a value function is attached to the end of the planning horizon. This value function is learned through off-policy reinforcement learning, using trajectory optimization as its behavior policy. Furthermore, we introduce "actor-guided" trajectory optimization to mitigate the actor-divergence issue in the proposed method. We benchmark our methods on continuous control tasks and demonstrate that it offers a significant improvement over the underlying model-based and model-free algorithms.


Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

arXiv.org Machine Learning

Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.