Goto

Collaborating Authors

 infinite-horizon


Constructing Dynamic Treatment Regimes in Infinite-Horizon Settings

arXiv.org Machine Learning

The application of existing methods for constructing optimal dynamic treatment regimes is limited to cases where investigators are interested in optimizing a utility function over a fixed period of time (finite horizon). In this manuscript, we develop an inferential procedure based on temporal difference residuals for optimal dynamic treatment regimes in infinite-horizon settings, where there is no a priori fixed end of follow-up point. The proposed method can be used to determine the optimal regime in chronic diseases where patients are monitored and treated throughout their life. We derive large sample results necessary for conducting inference. We also simulate a cohort of patients with diabetes to mimic the third wave of the National Health and Nutrition Examination Survey, and we examine the performance of the proposed method in controlling the level of hemoglobin A1c. Supplementary materials for this article are available online.


Computing Solutions in Infinite-Horizon Discounted Adversarial Patrolling Games

AAAI Conferences

Stackelberg games form the core of a number of tools deployed for computing optimal patrolling strategies in adversarial domains, such as the US Federal Air Marshall Service and the US Coast Guard. In traditional Stackelberg security game models the attacker knows only the probability that each target is covered by the defender, but is oblivious to the detailed timing of the coverage schedule. In many real-world situations, however, the attacker can observe the current location of the defender and can exploit this knowledge to reason about the defender’s future moves. We show that this general modeling framework can be captured using adversarial patrolling games (APGs) in which the defender sequentially moves between targets, with moves constrained by a graph, while the attacker can observe the defender’s current location and his (stochastic) policy concerning future moves. We offer a very general model of infinite-horizon discounted adversarial patrolling games. Our first contribution is to show that defender policies that condition only on the previous defense move (i.e., Markov stationary policies) can be arbitrarily suboptimal for general APGs. We then offer a mixed-integer non-linear programming (MINLP) formulation for computing optimal randomized policies for the defender that can condition on history of bounded, but arbitrary, length, as well as a mixed-integer linear programming (MILP) formulation to approximate these, with provable quality guarantees. Additionally, we present a non-linear programming (NLP) formulation for solving zero-sum APGs. We show experimentally that MILP significantly outperforms the MINLP formulation, and is, in turn, significantly outperformed by the NLP specialized to zero-sum games.


On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

arXiv.org Artificial Intelligence

We consider infinite-horizon $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $\pi_1,...,\pi_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of-the-art bound for the last stationary policy $\pi_k$ by a factor $\frac{1-\gamma}{1-\gamma^m}$. In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by $\epsilon$ at each iteration from $\frac{\gamma}{(1-\gamma)^2}\epsilon$ to $\frac{\gamma}{1-\gamma}\epsilon$, which is significant in the usual situation when $\gamma$ is close to 1. Given Bellman operators that can only be computed with some error $\epsilon$, a surprising consequence of this result is that the problem of "computing an approximately optimal non-stationary policy" is much simpler than that of "computing an approximately optimal stationary policy", and even slightly simpler than that of "approximately computing the value of some fixed policy", since this last problem only has a guarantee of $\frac{1}{1-\gamma}\epsilon$.