Goto

Collaborating Authors

 learning policy


Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Neural Information Processing Systems

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.


Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Neural Information Processing Systems

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of \tilde{\mathcal{O}}(\sqrt{K}) while allowing an \tilde{\mathcal{O}}(\sqrt{K}) constraint violation in K episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order \tilde{\mathcal{O}}(\sqrt{K}) .


Learning Optimal and Fair Policies for Online Allocation of Scarce Societal Resources from Data Collected in Deployment

Tang, Bill, Koçyiğit, Çağıl, Rice, Eric, Vayanos, Phebe

arXiv.org Artificial Intelligence

We study the problem of allocating scarce societal resources of different types (e.g., permanent housing, deceased donor kidneys for transplantation, ventilators) to heterogeneous allocatees on a waitlist (e.g., people experiencing homelessness, individuals suffering from end-stage renal disease, Covid-19 patients) based on their observed covariates. We leverage administrative data collected in deployment to design an online policy that maximizes expected outcomes while satisfying budget constraints, in the long run. Our proposed policy waitlists each individual for the resource maximizing the difference between their estimated mean treatment outcome and the estimated resource dual-price or, roughly, the opportunity cost of using the resource. Resources are then allocated as they arrive, in a first-come first-serve fashion. We demonstrate that our data-driven policy almost surely asymptotically achieves the expected outcome of the optimal out-of-sample policy under mild technical assumptions. We extend our framework to incorporate various fairness constraints. We evaluate the performance of our approach on the problem of designing policies for allocating scarce housing resources to people experiencing homelessness in Los Angeles based on data from the homeless management information system. In particular, we show that using our policies improves rates of exit from homelessness by 1.9% and that policies that are fair in either allocation or outcomes by race come at a very low price of fairness.


Robotic Handling of Compliant Food Objects by Robust Learning from Demonstration

Misimi, Ekrem, Olofsson, Alexander, Eilertsen, Aleksander, Øye, Elling Ruud, Mathiassen, John Reidar

arXiv.org Artificial Intelligence

The robotic handling of compliant and deformable food raw materials, characterized by high biological variation, complex geometrical 3D shapes, and mechanical structures and texture, is currently in huge demand in the ocean space, agricultural, and food industries. Many tasks in these industries are performed manually by human operators who, due to the laborious and tedious nature of their tasks, exhibit high variability in execution, with variable outcomes. The introduction of robotic automation for most complex processing tasks has been challenging due to current robot learning policies. A more consistent learning policy involving skilled operators is desired. In this paper, we address the problem of robot learning when presented with inconsistent demonstrations. To this end, we propose a robust learning policy based on Learning from Demonstration (LfD) for robotic grasping of food compliant objects. The approach uses a merging of RGB-D images and tactile data in order to estimate the necessary pose of the gripper, gripper finger configuration and forces exerted on the object in order to achieve effective robot handling. During LfD training, the gripper pose, finger configurations and tactile values for the fingers, as well as RGB-D images are saved. We present an LfD learning policy that automatically removes inconsistent demonstrations, and estimates the teacher's intended policy. The performance of our approach is validated and demonstrated for fragile and compliant food objects with complex 3D shapes. The proposed approach has a vast range of potential applications in the aforementioned industry sectors.


Knowing the Past to Predict the Future: Reinforcement Virtual Learning

Zhang, Peng, Huang, Yawen, Hu, Bingzhang, Wang, Shizheng, Duan, Haoran, Moubayed, Noura Al, Zheng, Yefeng, Long, Yang

arXiv.org Artificial Intelligence

Reinforcement Learning (RL)-based control system has received considerable attention in recent decades. However, in many real-world problems, such as Batch Process Control, the environment is uncertain, which requires expensive interaction to acquire the state and reward values. In this paper, we present a cost-efficient framework, such that the RL model can evolve for itself in a Virtual Space using the predictive models with only historical data. The proposed framework enables a step-by-step RL model to predict the future state and select optimal actions for long-sight decisions. The main focuses are summarized as: 1) how to balance the long-sight and short-sight rewards with an optimal strategy; 2) how to make the virtual model interacting with real environment to converge to a final learning policy. Under the experimental settings of Fed-Batch Process, our method consistently outperforms the existing state-of-the-art methods.


What is Hierarchical Reinforcement Learning?

#artificialintelligence

I recently started an AI-focused educational newsletter, that already has over 80,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Humans have a great ability to reuse prior knowledge when mastering new skills. Just go through the steps you follow to learn a new cooking recipe or a new workout routine.


Deep Reinforcement Learning for Adaptive Learning Systems

Li, Xiao, Xu, Hanchen, Zhang, Jinming, Chang, Hua-hua

arXiv.org Machine Learning

In this paper, we formulate the adaptive learning problem---the problem of how to find an individualized learning plan (called policy) that chooses the most appropriate learning materials based on learner's latent traits---faced in adaptive learning systems as a Markov decision process (MDP). We assume latent traits to be continuous with an unknown transition model. We apply a model-free deep reinforcement learning algorithm---the deep Q-learning algorithm---that can effectively find the optimal learning policy from data on learners' learning process without knowing the actual transition model of the learners' continuous latent traits. To efficiently utilize available data, we also develop a transition model estimator that emulates the learner's learning process using neural networks. The transition model estimator can be used in the deep Q-learning algorithm so that it can more efficiently discover the optimal learning policy for a learner. Numerical simulation studies verify that the proposed algorithm is very efficient in finding a good learning policy, especially with the aid of a transition model estimator, it can find the optimal learning policy after training using a small number of learners.


Understanding AlphaGo

#artificialintelligence

One of required skills as an Artificial Intelligence engineer is ability to understand and explain highly technical research papers in this field. One of my projects as a student in AI Nanodegree classes is an analysis of seminal paper in the field of Game-Playing. The target of my analysis was Nature's paper about technical side of AlphaGo -- Google Deepmind system which for the first time in history beat elite professional Go player, winning by 5 games to 0 with European Go champion -- Fan Hui. The goal of this summary (and my future publications) is to make this knowledge widely understandable, especially for those who are just starting the journey in field of AI or those who doesn't have any experience in this area at all. AlphaGo is narrow AI created by Google DeepMind team to play (and win) board game Go. Before it was presented publicly, the predictions said that according to our state-of-art, we are about 1 decade away from having system with AlphaGo skills (capability to beat a human professional Go player).


Learning Policies For Learning Policies -- Meta Reinforcement Learning (RL²) in Tensorflow

#artificialintelligence

Reinforcement Learning provides a framework for training agents to solve problems in the world. One of the limitations of these agents however is their inflexibility once trained. They are able to learn a policy to solve a specific problem (formalized as an MDP), but that learned policy is often useless in new problems, even relatively similar ones. Imagine the simplest possible agent: one trained to solve a two-armed bandit task in which one arm always provides a positive reward, and the other arm always provides no reward. Using any RL algorithm such as Q-Learning or Policy Gradient, the agent can quickly learn to always choose the arm with the positive reward.


Learning Policies in Partially Observable MDPs with Abstract Actions Using Value Iteration

Janzadeh, Hamed (The University of Texas at Arlington) | Huber, Manfred (The University of Texas at Arlington)

AAAI Conferences

While the use of abstraction and its benefit in terms of transferring learned information to new tasks has been studied extensively and successfully in MDPs, it has not been studied in the context of Partially Observable MDPs. This paper addresses the problem of transferring skills from previous experiences in POMDP models using high-level actions (options). It shows that the optimal value function remains piecewise-linear and convex when policies are high-level actions, and shows how value iteration algorithms can be modified to support options. The results can be applied to all existing value Iteration algorithms. Experiments show how adding options can speed up the learning process.