Nguyen, Thanh Hong
ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization
Bui, The Viet, Nguyen, Thanh Hong, Mai, Tien
Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre-collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single-agent settings, offline multi-agent reinforcement learning (MARL) presents additional challenges due to the large joint state-action space and the complexity of multi-agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents' local policies and the expansive joint state-action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q-functions or policies. In this work, we introduce a regularizer in the space of stationary distributions to better handle distributional shift. Our algorithm, ComaDICE, offers a principled framework for offline cooperative MARL by incorporating stationary distribution regularization for the global learning policy, complemented by a carefully structured multi-agent value decomposition strategy to facilitate multi-agent training. Through extensive experiments on the multi-agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state-of-the-art offline MARL methods across nearly all tasks. Over the years, deep RL has achieved remarkable success in various decision-making tasks (Levine et al., 2016; Silver et al., 2017; Kalashnikov et al., 2018; Haydari & Yฤฑlmaz, 2020). However, a significant limitation of deep RL is its need for millions of interactions with the environment to gather experiences for policy improvement.
Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning
Chen, Changyu, Karunasena, Ramesha, Nguyen, Thanh Hong, Sinha, Arunesh, Varakantham, Pradeep
Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces; these include problems in randomized allocation of resources such as placements of multiple security resources and emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large, for which existing RL methods do not perform well. Moreover, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. The allocation nature of the problem also prefers stochastic optimal policies, if one exists. In this work, we address these challenges by (1) applying a (state) conditional normalizing flow to compactly represent the stochastic policy -- the compactness arises due to the network only producing one sampled action and the corresponding log probability of the action, which is then used by an actor-critic method; and (2) employing an invalid action rejection method (via a valid action oracle) to update the base policy. The action rejection is enabled by a modified policy gradient that we derive. Finally, we conduct extensive experiments to show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state.
Inverse Factorized Q-Learning for Cooperative Multi-agent Imitation Learning
Bui, The Viet, Mai, Tien, Nguyen, Thanh Hong
This paper concerns imitation learning (IL) (i.e, the problem of learning to mimic expert behaviors from demonstrations) in cooperative multi-agent systems. The learning problem under consideration poses several challenges, characterized by high-dimensional state and action spaces and intricate inter-agent dependencies. In a single-agent setting, IL has proven to be done efficiently through an inverse soft-Q learning process given expert demonstrations. However, extending this framework to a multi-agent context introduces the need to simultaneously learn both local value functions to capture local observations and individual actions, and a joint value function for exploiting centralized learning. In this work, we introduce a novel multi-agent IL algorithm designed to address these challenges. Our approach enables the centralized learning by leveraging mixing networks to aggregate decentralized Q functions. A main advantage of this approach is that the weights of the mixing networks can be trained using information derived from global states. We further establish conditions for the mixing networks under which the multi-agent objective function exhibits convexity within the Q function space. We present extensive experiments conducted on some challenging competitive and cooperative multi-agent game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2), which demonstrates the effectiveness of our proposed algorithm compared to existing state-of-the-art multi-agent IL algorithms.
Mimicking To Dominate: Imitation Learning Strategies for Success in Multiagent Competitive Games
Bui, The Viet, Mai, Tien, Nguyen, Thanh Hong
Training agents in multi-agent competitive games presents significant challenges due to their intricate nature. These challenges are exacerbated by dynamics influenced not only by the environment but also by opponents' strategies. Existing methods often struggle with slow convergence and instability. To address this, we harness the potential of imitation learning to comprehend and anticipate opponents' behavior, aiming to mitigate uncertainties with respect to the game dynamics. Our key contributions include: (i) a new multi-agent imitation learning model for predicting next moves of the opponents -- our model works with hidden opponents' actions and local observations; (ii) a new multi-agent reinforcement learning algorithm that combines our imitation learning model and policy training into one single training process; and (iii) extensive experiments in three challenging game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2). Experimental results show that our approach achieves superior performance compared to existing state-of-the-art multi-agent RL algorithms.
CounterNet: End-to-End Training of Prediction Aware Counterfactual Explanations
Guo, Hangzhi, Nguyen, Thanh Hong, Yadav, Amulya
This work presents CounterNet, a novel end-to-end learning framework which integrates Machine Learning (ML) model training and the generation of corresponding counterfactual (CF) explanations into a single end-to-end pipeline. Counterfactual explanations offer a contrastive case, i.e., they attempt to find the smallest modification to the feature values of an instance that changes the prediction of the ML model on that instance to a predefined output. Prior techniques for generating CF explanations suffer from two major limitations: (i) all of them are post-hoc methods designed for use with proprietary ML models -- as a result, their procedure for generating CF explanations is uninformed by the training of the ML model, which leads to misalignment between model predictions and explanations; and (ii) most of them rely on solving separate time-intensive optimization problems to find CF explanations for each input data point (which negatively impacts their runtime). This work makes a novel departure from the prevalent post-hoc paradigm (of generating CF explanations) by presenting CounterNet, an end-to-end learning framework which integrates predictive model training and the generation of counterfactual (CF) explanations into a single pipeline. Unlike post-hoc methods, CounterNet enables the optimization of the CF explanation generation only once together with the predictive model. We adopt a block-wise coordinate descent procedure which helps in effectively training CounterNet's network. Our extensive experiments on multiple real-world datasets show that CounterNet generates high-quality predictions, and consistently achieves 100% CF validity and low proximity scores (thereby achieving a well-balanced cost-invalidity trade-off) for any new input instance, and runs 3X faster than existing state-of-the-art baselines.
Conquering Adversary Behavioral Uncertainty in Security Games: An Efficient Modeling Robust Based Algorithm
Nguyen, Thanh Hong (University of Southern California) | Sinha, Arunesh (University of Southern California) | Tambe, Milind (University of Southern California)
Stackelberg Security Games (SSG) have been widely applied for solving real-world security problemsโwith a significant research emphasis on modeling attackersโ behaviors to handle their bounded rationality. However, access to real-world data (used for learning an accurate behavioral model) is often limited, leading to uncertainty in attackerโs behaviors while modeling. This paper therefore focuses on addressing behavioral uncertainty in SSG with the following main contributions: 1) we present a new uncertainty game model that integrates uncertainty intervals into a behavioral model to capture behavioral uncertainty; 2) based on this game model, we propose a novel robust algorithm that approximately computes the defenderโs optimal strategy in the worst-case scenario of uncertaintyโwith a bound guarantee on its solution quality.
Protecting Wildlife under Imperfect Observation
Nguyen, Thanh Hong (University of Southern California) | Sinha, Arunesh (University of Southern California) | Gholami, Shahrzad (University of Southern California) | Plumptre, Andrew ( Wildlife Conservation Society ) | Joppa, Lucas ( Microsoft Research ) | Tambe, Milind (University of Southern California) | Driciru, Margaret ( Uganda Wildlife Authority ) | Wanyama, Fred ( Uganda Wildlife Authority ) | Rwetsiba, Aggrey ( Uganda Wildlife Authority ) | Critchlow, Rob ( The University of York ) | Beale, Colin ( The University of York )
Wildlife poaching presents a serious extinction threat to many animal species. In order to save wildlife in designated wildlife parks, park rangers conduct patrols over the park area to combat such illegal activities. An important aspect of the patrolling activity of the rangers is to anticipate where the poachers are likely to catch animals and then respond accordingly. Previous work has applied defender-attacker Stackelberg Security Games (SSGs) to solve the problem of wildlife protection, wherein attacker behavioral models are used to predict the behaviors of the poachers. However, these behavioral models have several limitations which limit their accuracy in predicting poachers' behavior. First, existing models fail to account for the rangers' imperfect observations w.r.t poaching activities (due to the limited capability of rangers to patrol thoroughly over a vast geographical area). Second, these models are built upon discrete choice models that assume a single agent choosing targets, while it is infeasible to obtain information about every single attacker in wildlife protection. Third, these models do not consider the effect of past poachers' actions on the current poachers' activities, one of the key factors affecting the poachers' behaviors. In this work, we attempt to address these limitations while providing three main contributions. First, we propose a novel hierarchical behavioral model, HiBRID, to predict the poachers' behaviors wherein the rangers' imperfect detection of poaching signs is taken into account --- a significant advance towards existing behavioral models in security games. Furthermore, HiBRID incorporates the temporal effect on the poachers' behaviors. The model also does not require a known number of attackers. Second, we provide two new heuristics: \textit{parameter separation} and \textit{target abstraction} to reduce the computational complexity in learning the model parameters. Finally, we use the real-world data collected in Queen Elizabeth National Park (QENP) in Uganda over 12 years to evaluate the prediction accuracy of our new model.
Regret-Based Optimization and Preference Elicitation for Stackelberg Security Games with Uncertainty
Nguyen, Thanh Hong (University of Southern California) | Yadav, Amulya (University of Southern California) | An, Bo (Nanyang Technological University) | Tambe, Milind (University of Southern California) | Boutilier, Craig (University of Toronto)
Stackelberg security games (SSGs) have been deployed in a number of real-world domains. One key challenge in these applications is the assessment of attacker payoffs, which may not be perfectly known. Previous work has studied SSGs with uncertain payoffs modeled by interval uncertainty and provided maximin-based robust solutions. In contrast, in this work we propose the use of the less conservative minimax regret decision criterion for such payoff-uncertain SSGs and present the first algorithms for computing minimax regret for SSGs. We also address the challenge of preference elicitation, using minimax regret to develop the first elicitation strategies for SSGs. Experimental results validate the effectiveness of our approaches.
Analyzing the Effectiveness of Adversary Modeling in Security Games
Nguyen, Thanh Hong (University of Southern California) | Yang, Rong (University of Southern California) | Azaria, Amos (Bar-Ilan University) | Kraus, Sarit (Bar-Ilan University and University of Maryland) | Tambe, Milind (University of Southern California)
Recent deployments of Stackelberg security games (SSG) have led to two competing approaches to handle boundedly rational human adversaries: (1) integrating models of human (adversary) decision-making into the game-theoretic algorithms, and (2) applying robust optimization techniques that avoid adversary modeling. A recent algorithm (MATCH) based on the second approach was shown to outperform the leading modeling-based algorithm even in the presence of significant amount of data. Is there then any value in using human behavior models in solving SSGs? Through extensive experiments with 547 human subjects playing 11102 games in total, we emphatically answer the question in the affirmative, while providing the following key contributions: (i) we show that our algorithm, SU-BRQR, based on a novel integration of human behavior model with the subjective utility function, significantly outperforms both MATCH and its improvements; (ii) we are the first to present experimental results with security intelligence experts, and find that even though the experts are more rational than the Amazon Turk workers, SU-BRQR still outperforms an approach assuming perfect rationality (and to a more limited extent MATCH); (iii) we show the advantage of SU-BRQR in a new, large game setting and demonstrate that sufficient data enables it to improve its performance over MATCH.