continuous time
Group-Fair Online Allocation in Continuous Time
The theory of discrete-time online learning has been successfully applied in many problems that involve sequential decision-making under uncertainty. However, in many applications including contractual hiring in online freelancing platforms and server allocation in cloud computing systems, the outcome of each action is observed only after a random and action-dependent time. Furthermore, as a consequence of certain ethical and economic concerns, the controller may impose deadlines on the completion of each task, and require fairness across different groups in the allocation of total time budget $B$. In order to address these applications, we consider continuous-time online learning problem with fairness considerations, and present a novel framework based on continuous-time utility maximization. We show that this formulation recovers reward-maximizing, max-min fair and proportionally fair allocation rules across different groups as special cases. We characterize the optimal offline policy, which allocates the total time between different actions in an optimally fair way (as defined by the utility function), and impose deadlines to maximize time-efficiency. In the absence of any statistical knowledge, we propose a novel online learning algorithm based on dual ascent optimization for time averages, and prove that it achieves $\tilde{O}(B^{-1/2})$ regret bound.
POMDPs in Continuous Time and Discrete Spaces
Many processes, such as discrete event systems in engineering or population dynamics in biology, evolve in discrete space and continuous time. We consider the problem of optimal decision making in such discrete state and action space systems under partial observability. This places our work at the intersection of optimal filtering and optimal control. At the current state of research, a mathematical description for simultaneous decision making and filtering in continuous time with finite state and action spaces is still missing. In this paper, we give a mathematical description of a continuous-time partial observable Markov decision process (POMDP). By leveraging optimal filtering theory we derive a Hamilton-Jacobi-Bellman (HJB) type equation that characterizes the optimal solution. Using techniques from deep learning we approximately solve the resulting partial integro-differential equation. We present (i) an approach solving the decision problem offline by learning an approximation of the value function and (ii) an online algorithm which provides a solution in belief space using deep reinforcement learning. We show the applicability on a set of toy examples which pave the way for future methods providing solutions for high dimensional problems.
Nash Equilibrium and Belief Evolution in Differential Games
Zhou, Jiangjing, Petrosian, Ovanes, Zhang, Ye, Gao, Hongwei
Differential games [4, 6] involve multiple players controlling a dynamical system through their actions, which are described by differential state equations. These games evolve over a continuous-time horizon, where each player seeks to optimize an objective function that depends on the system's state, their own actions, and potentially the actions of others. In this study, we extend the classic differential game model to scenarios involving motion-payoff uncertainty, where players face uncertainties in both the dynamic equations and the payoff functions, and are unaware of certain parameters in the environment or in their opponents' payoff structures. In dynamic games, optimal control techniques are generalized to accommodate multiple players with both shared and conflicting interests. As shown in [9], if a set of interconnected partial differential equations--commonly referred to as the Hamilton-Jacobi-Bellman (HJB) equations--has solutions, then a Nash equilibrium can be achieved. At this equilibrium, no player can improve their outcome by unilaterally changing their strategy. However, traditional dynamic game models often assume that all players possess complete knowledge of the game. In many real-world scenarios, players face rapidly changing and uncertain environments, leading to incomplete information about the system's dynamics and payoffs [22, 3, 15, 1]. To address this uncertainty, we apply Bayesian updating methods, where players update their beliefs about unknown parameters as new information becomes available.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Shandong Province > Qingdao (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (6 more...)
ZORMS-LfD: Learning from Demonstrations with Zeroth-Order Random Matrix Search
Dry, Olivia, Molloy, Timothy L., Jin, Wanxin, Shames, Iman
We propose Zeroth-Order Random Matrix Search for Learning from Demonstrations (ZORMS-LfD). ZORMS-LfD enables the costs, constraints, and dynamics of constrained optimal control problems, in both continuous and discrete time, to be learned from expert demonstrations without requiring smoothness of the learning-loss landscape. In contrast, existing state-of-the-art first-order methods require the existence and computation of gradients of the costs, constraints, dynamics, and learning loss with respect to states, controls and/or parameters. Most existing methods are also tailored to discrete time, with constrained problems in continuous time receiving only cursory attention. We demonstrate that ZORMS-LfD matches or surpasses the performance of state-of-the-art methods in terms of both learning loss and compute time across a variety of benchmark problems. On unconstrained continuous-time benchmark problems, ZORMS-LfD achieves similar loss performance to state-of-the-art first-order methods with an over $80$\% reduction in compute time. On constrained continuous-time benchmark problems where there is no specialized state-of-the-art method, ZORMS-LfD is shown to outperform the commonly used gradient-free Nelder-Mead optimization method.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Time After Time: Deep-Q Effect Estimation for Interventions on When and What to do
Wald, Yoav, Goldstein, Mark, Efroni, Yonathan, van Amsterdam, Wouter A. C., Ranganath, Rajesh
Problems in fields such as healthcare, robotics, and finance requires reasoning about the value both of what decision or action to take and when to take it. The prevailing hope is that artificial intelligence will support such decisions by estimating the causal effect of policies such as how to treat patients or how to allocate resources over time. However, existing methods for estimating the effect of a policy struggle with \emph{irregular time}. They either discretize time, or disregard the effect of timing policies. We present a new deep-Q algorithm that estimates the effect of both when and what to do called Earliest Disagreement Q-Evaluation (EDQ). EDQ makes use of recursion for the Q-function that is compatible with flexible sequence models, such as transformers. EDQ provides accurate estimates under standard assumptions. We validate the approach through experiments on survival time and tumor growth tasks.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > New York (0.04)
Export Reviews, Discussions, Author Feedback and Meta-Reviews
This paper presents a hybrid approach for using both crowdsourced labels and an incrementally (online) trained model to address prediction problems; the core idea is to lean heavily on the crowd as the system is ramping up, learn from the labels thus acquired, and then use the crowd less and less often as the model becomes more confident. This is done via a sophisticated framing of the problem as a stochastic game based on a CRF prediction model in which the system and the crowd are both players. The system can issue one or more queries q for tokens x (with true label y) which elicit responses r, where there is a utility U(q,r) for each outcome; the system thus attempts to pick the actions that will maximize the expected utility. Furthermore, the queries are not issued all at once, but at times s (with response times t); utility is maximized with respect to a t_deadline by which an answer needs to be computed (this thus determines how many queries are sent out, at what rate, etc.) Computing this expected utility requires using the simulation dynamics model P(y,r,t x,q,s) in order to compute the utilities as in (4). Given the utility values, the optimal action could be chosen; however, the introduction of continuous time makes this intractable to optimize and as such an approximation is used based on Monte Carlo Tree Search and TD learning (Algorithm 1).