We consider the problem of scheduling an unknown sequence of tasks for a single server as the tasks arrive with the goal off maximizing the total weighted value of the tasks served before their deadline is reached. This problem is faced for example by schedulers in packet communication networks when packets have deadlines and rewards associated with them. We make the simplifying assumptions that every task takes the same fixed amount of time to serve, that every task arrives with the same initial latency to its deadline. We also assume that future task arrivals are stochastically described by a Hidden Markov Model (HMM). The resulting decision problem can be formally modelled as a Partially Observable Markov Decision Process (POMDP). We first present and analyze a new optimal off-line scheduling algorithm called Prescient Minloss scheduling for the problem just described, but with "prescient" foreknowledge of the future task arrivals. We then discuss heuristically adapting this off-line algorithm into an online algorithm by sampling possible future task sequences from the HMM. We discuss and empirically compare scheduling methods for this online problem, including previously proposed samplingbased POMDP solution methods. Our heuristic approach can be used to adapt any off-line scheduler into an online scheduler.
Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning -- in the form of a policy-gradient method -- to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach isto construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use.
Sequential decision-making problems with multiple objectives arise naturally in practice and pose unique challenges for research in decision-theoretic planning and learning, which has largely focused on single-objective settings. This article surveys algorithms designed for sequential decision-making problems with multiple objectives. Though there is a growing body of literature on this subject, little of it makes explicit under what circumstances special methods are needed to solve multi-objective problems. Therefore, we identify three distinct scenarios in which converting such a problem to a single-objective one is impossible, infeasible, or undesirable. Furthermore, we propose a taxonomy that classifies multi-objective methods according to the applicable scenario, the nature of the scalarization function (which projects multi-objective values to scalar ones), and the type of policies considered. We show how these factors determine the nature of an optimal solution, which can be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we survey the literature on multi-objective methods for planning and learning. Finally, we discuss key applications of such methods and outline opportunities for future work.
Goal-directed Markov Decision Process models (GDMDPs) are good models for many decision-theoretic planning tasks. They have been used in conjunction with two different reward structures, namely the goal-reward representation and the action-penalty representation. We apply GDMDPs to planning tasks in the presence of traps such as steep slopes for outdoor robots or staircases for indoor robots, and study the differences between the two reward structures. In these situations, achieving the goal is often the primary objective while minimizing the travel time is only of secondary importance. We show that the action-penalty representation without discounting guarantees that the optimal plan achieves the goal for sure (if this is possible) but neither the actionpenalty representation with discounting nor the goal-reward representation with discounting have this property. We then show exactly when this trapping phenomenon occurs, using a novel interpretation for discounting, namely that it models agents that useconvexexponentialutility functions and thus are optimistic in the face of uncertainty. Finally, we show how the trapping phenomenon can be eliminated with our Selective State-Deletion Method.
An important problem for the Internet is how to provide a guaranteed quality of service to users, in contrast to the current "best-effort" service. A key aspect of this problem is how routers should share network capacity between different classes of traffic. This decision needs to be made for each incoming packet, and is known as the packet scheduling problem. A major challenge in packet scheduling is that the behaviour of each traffic class may not be known in advance, and can vary dynamically. In this paper, we describe how we have modelled the packet scheduling problem as an application for reinforcement learning (RL). We demonstrate how our RL approach can learn scheduling policies that satisfy the quality of service requirements of multiple traffic classes under a variety of conditions. We also present an insight into the effectiveness of two different RL algorithms in this context. A major benefit of this approach is that we can help network providers deliver a guaranteed quality of service to customers without manual fine-tuning of the network routers.