Tang, Dengwang
Information Compression in Dynamic Games
Tang, Dengwang, Subramanian, Vijay, Teneketzis, Demosthenis
One of the reasons why stochastic dynamic games with an underlying dynamic system are challenging is since strategic players have access to enormous amount of information which leads to the use of extremely complex strategies at equilibrium. One approach to resolve this challenge is to simplify players' strategies by identifying appropriate compression of information maps so that the players can make decisions solely based on the compressed version of information, called the information state. For finite dynamic games with asymmetric information, inspired by the notion of information state for single-agent control problems, we propose two notions of information states, namely mutually sufficient information (MSI) and unilaterally sufficient information (USI). Both these information states are obtained with information compression maps independent of the strategy profile. We show that Bayes-Nash Equilibria (BNE) and Sequential Equilibria (SE) exist when all players use MSI-based strategies. We prove that when all players employ USI-based strategies the resulting sets of BNE and SE payoff profiles are the same as the sets of BNE and SE payoff profiles resulting when all players use full information-based strategies. We prove that when all players use USI-based strategies the resulting set of weak Perfect Bayesian Equilibrium (wPBE) payoff profiles can be a proper subset of all wPBE payoff profiles. We identify MSI and USI in specific models of dynamic games in the literature. We end by presenting an open problem: Do there exist strategy-dependent information compression maps that guarantee the existence of at least one equilibrium or maintain all equilibria that exist under perfect recall? We show, by a counterexample, that a well-known strategy-dependent information compression map used in the literature does not possess any of the properties of MSI or USI.
Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget
Tang, Dengwang, Jain, Rahul, Nayyar, Ashutosh, Nuzzo, Pierluigi
In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomized mixture of multiple arms. The goal thus is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We propose a novel, parameter-free algorithm, called the Score Function-based Successive Reject (SFSR) algorithm, that combines the classical successive reject framework with a novel score-function-based rejection criteria based on linear programming theory to identify the optimal support. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$ and some constants that characterize the hardness of the problem instance. We also develop an information theoretic lower bound on the error probability that shows that these constants appropriately characterize the problem difficulty. We validate this empirically on a number of average and hard instances.
Information Compression in Dynamic Information Disclosure Games
Tang, Dengwang, Subramanian, Vijay G.
We consider a two-player dynamic information design problem between a principal and a receiver -- a game is played between the two agents on top of a Markovian system controlled by the receiver's actions, where the principal obtains and strategically shares some information about the underlying system with the receiver in order to influence their actions. In our setting, both players have long-term objectives, and the principal sequentially commits to their strategies instead of committing at the beginning. Further, the principal cannot directly observe the system state, but at every turn they can choose randomized experiments to observe the system partially. The principal can share details about the experiments to the receiver. For our analysis we impose the truthful disclosure rule: the principal is required to truthfully announce the details and the result of each experiment to the receiver immediately after the experiment result is revealed. Based on the received information, the receiver takes an action when its their turn, with the action influencing the state of the underlying system. We show that there exist Perfect Bayesian equilibria in this game where both agents play Canonical Belief Based (CBB) strategies using a compressed version of their information, rather than full information, to choose experiments (for the principal) or actions (for the receiver). We also provide a backward inductive procedure to solve for an equilibrium in CBB strategies.
Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach
Tang, Dengwang, Jain, Rahul, Hao, Botao, Wen, Zheng
In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose an approximate Informed RLSVI algorithm that we can interpret as performing imitation learning with the offline dataset, and then performing online learning.
Regret Analysis of the Posterior Sampling-based Learning Algorithm for Episodic POMDPs
Tang, Dengwang, Jain, Rahul, Nayyar, Ashutosh, Nuzzo, Pierluigi
Compared to Markov Decision Processes (MDPs), learning in Partially Observable Markov Decision Processes (POMDPs) can be significantly harder due to the difficulty of interpreting observations. In this paper, we consider episodic learning problems in POMDPs with unknown transition and observation models. We consider the Posterior Sampling-based Reinforcement Learning (PSRL) algorithm for POMDPs and show that its Bayesian regret scales as the square root of the number of episodes. In general, the regret scales exponentially with the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, under the condition that the POMDP is undercomplete and weakly revealing, we establish a polynomial Bayesian regret bound that improves the regret bound by a factor of $\Omega(H^2\sqrt{SA})$ over the recent result by arXiv:2204.08967.
Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale
Hao, Botao, Jain, Rahul, Tang, Dengwang, Wen, Zheng
In this paper, we address the following problem: Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs. We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset. Its cumulative Bayesian regret goes down to zero exponentially fast in N, the offline dataset size if the expert is competent enough. Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning. Our empirical results show that the proposed iRLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without information about the generative policy. Our algorithm bridges online RL and imitation learning for the first time.
A Novel Point-based Algorithm for Multi-agent Control Using the Common Information Approach
Tang, Dengwang, Nayyar, Ashutosh, Jain, Rahul
The Common Information (CI) approach provides a systematic way to transform a multi-agent stochastic control problem to a single-agent partially observed Markov decision problem (POMDP) called the coordinator's POMDP. However, such a POMDP can be hard to solve due to its extraordinarily large action space. We propose a new algorithm for multi-agent stochastic control problems, called coordinator's heuristic search value iteration (CHSVI), that combines the CI approach and point-based POMDP algorithms for large action spaces. We demonstrate the algorithm through optimally solving several benchmark problems.
Dynamic Games among Teams with Delayed Intra-Team Information Sharing
Tang, Dengwang, Tavafoghi, Hamidreza, Subramanian, Vijay, Nayyar, Ashutosh, Teneketzis, Demosthenis
We analyze a class of stochastic dynamic games among teams with asymmetric information, where members of a team share their observations internally with a delay of $d$. Each team is associated with a controlled Markov Chain, whose dynamics are coupled through the players' actions. These games exhibit challenges in both theory and practice due to the presence of signaling and the increasing domain of information over time. We develop a general approach to characterize a subset of Nash Equilibria where the agents can use a compressed version of their information, instead of the full information, to choose their actions. We identify two subclasses of strategies: Sufficient Private Information Based (SPIB) strategies, which only compress private information, and Compressed Information Based (CIB) strategies, which compress both common and private information. We show that while SPIB-strategy-based equilibria always exist, the same is not true for CIB-strategy-based equilibria. We develop a backward inductive sequential procedure, whose solution (if it exists) provides a CIB strategy-based equilibrium. We identify some instances where we can guarantee the existence of a solution to the above procedure. Our results highlight the tension among compression of information, existence of (compression based) equilibria, and backward inductive sequential computation of such equilibria in stochastic dynamic games with asymmetric information.