Goto

Collaborating Authors

 bowling


Deep (Predictive) Discounted Counterfactual Regret Minimization

Xu, Hang, Li, Kai, Fu, Haobo, Fu, Qiang, Xing, Junliang, Cheng, Jian

arXiv.org Artificial Intelligence

Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. To enhance CFR's applicability in large games, researchers use neural networks to approximate its behavior. However, existing methods are mainly based on vanilla CFR and struggle to effectively integrate more advanced CFR variants. In this work, we propose an efficient model-free neural CFR algorithm, overcoming the limitations of existing methods in approximating advanced CFR variants. At each iteration, it collects variance-reduced sampled advantages based on a value network, fits cumulative advantages by bootstrapping, and applies discounting and clipping operations to simulate the update mechanisms of advanced CFR variants. Experimental results show that, compared with model-free neural algorithms, it exhibits faster convergence in typical imperfect-information games and demonstrates stronger adversarial performance in a large poker game.


The Age of Goggles Has Arrived

The Atlantic - Technology

"Vision Pro feels familiar, yet it's entirely new." That's how Apple's CEO, Tim Cook, introduced the company's new computer goggles at the Apple Worldwide Developers Conference on Monday. The Vision Pro headset, which resembles a glass scuba mask with a fabric head strap, seamlessly blends the real and digital worlds, Cook said. But the product's name, which could just as easily describe a brand of contact-lens solution, hints at a challenge. Familiar yet entirely new, natural but augmented: If goggles really are the future of computing, they will have to overcome a bevy of conflicting sentiments. As you might expect, Apple's product is slick.


'Nintendo Switch Sports' is a grab bag of great games and duds

Washington Post - Technology News

Bowling, by contrast, is no challenge at all. Twice in a row, while demonstrating to a friend how to throw the ball -- half looking at him, half at the TV -- I landed strikes. Bowling is about the boom and bust of motion control randomness. It is fun, it turns out, to be marginally, accidentally better than someone! It's not a good game, but it's a recipe for a good hang.


Fast Algorithms for Poker Require Modelling it as a Sequential Bayesian Game

Kovařík, Vojtěch, Milec, David, Šustr, Michal, Seitz, Dominik, Lisý, Viliam

arXiv.org Artificial Intelligence

Many recent results in imperfect information games were only formulated for, or evaluated on, poker and poker-like games such as liar's dice. We argue that sequential Bayesian games constitute a natural class of games for generalizing these results. In particular, this model allows for an elegant formulation of the counterfactual regret minimization algorithm, called public-state CFR (PS-CFR), which naturally lends itself to an efficient implementation. Empirically, solving a poker subgame with 10^7 states by public-state CFR takes 3 minutes and 700 MB while a comparable version of vanilla CFR takes 5.5 hours and 20 GB. Additionally, the public-state formulation of CFR opens up the possibility for exploiting domain-specific assumptions, leading to a quadratic reduction in asymptotic complexity (and a further empirical speedup) over vanilla CFR in poker and other domains. Overall, this suggests that the ability to represent poker as a sequential Bayesian game played a key role in the success of CFR-based methods. Finally, we extend public-state CFR to general extensive-form games, arguing that this extension enjoys some - but not all - of the benefits of the version for sequential Bayesian games.


The Partially Observable History Process

Morrill, Dustin, Greenwald, Amy R., Bowling, Michael

arXiv.org Artificial Intelligence

We introduce the partially observable history process (POHP) formalism for reinforcement learning. POHP centers around the actions and observations of a single agent and abstracts away the presence of other players without reducing them to stochastic processes. Our formalism provides a streamlined interface for designing algorithms that defy categorization as exclusively single or multi-agent, and for developing theory that applies across these domains. We show how the POHP formalism unifies traditional models including the Markov decision process, the Markov game, the extensive-form game, and their partially observable extensions, without introducing burdensome technical machinery or violating the philosophical underpinnings of reinforcement learning. We illustrate the utility of our formalism by concisely exploring observable sequential rationality, re-deriving the extensive-form regret minimization (EFR) algorithm, and examining EFR's theoretical properties in greater generality.


Shielding Atari Games with Bounded Prescience

Giacobbe, Mirco, Hasanbeig, Mohammadhosein, Kroening, Daniel, Wijk, Hjalmar

arXiv.org Artificial Intelligence

Deep reinforcement learning (DRL) is applied in safety-critical domains such as robotics and autonomous driving. It achieves superhuman abilities in many tasks, however whether DRL agents can be shown to act safely is an open problem. Atari games are a simple yet challenging exemplar for evaluating the safety of DRL agents and feature a diverse portfolio of game mechanics. The safety of neural agents has been studied before using methods that either require a model of the system dynamics or an abstraction; unfortunately, these are unsuitable to Atari games because their low-level dynamics are complex and hidden inside their emulator. We present the first exact method for analysing and ensuring the safety of DRL agents for Atari games. Our method only requires access to the emulator. First, we give a set of 43 properties that characterise "safe behaviour" for 30 games. Second, we develop a method for exploring all traces induced by an agent and a game and consider a variety of sources of game non-determinism. We observe that the best available DRL agents reliably satisfy only very few properties; several critical properties are violated by all agents. Finally, we propose a countermeasure that combines a bounded explicit-state exploration with shielding. We demonstrate that our method improves the safety of all agents over multiple properties.


Understanding Finite-State Representations of Recurrent Policy Networks

Danesh, Mohamad H., Koul, Anurag, Fern, Alan, Khorram, Saeed

arXiv.org Machine Learning

We introduce an approach for understanding finite-state machine (FSM) representations of recurrent policy networks. Recent work focused on minimizing FSMs to gain high-level insight, however, minimization can obscure a deeper understanding by merging states that are semantically distinct. Conversely, our approach starts with an unminimized machine and applies more-interpretable reductions that preserve the key decision points of the policy. We also contribute a saliency tool to attain a deeper understanding of the role of observations in the decisions. Our case studies on policies from 7 Atari games and 3 control benchmarks demonstrate that the approach can reveal insights that have not been noticed in prior work.


FRESH: Interactive Reward Shaping in High-Dimensional State Spaces using Human Feedback

Xiao, Baicen, Lu, Qifan, Ramasubramanian, Bhaskar, Clark, Andrew, Bushnell, Linda, Poovendran, Radha

arXiv.org Artificial Intelligence

Reinforcement learning has been successful in training autonomous agents to accomplish goals in complex environments. Although this has been adapted to multiple settings, including robotics and computer games, human players often find it easier to obtain higher rewards in some environments than reinforcement learning algorithms. This is especially true of high-dimensional state spaces where the reward obtained by the agent is sparse or extremely delayed. In this paper, we seek to effectively integrate feedback signals supplied by a human operator with deep reinforcement learning algorithms in high-dimensional state spaces. We call this FRESH (Feedback-based REward SHaping). During training, a human operator is presented with trajectories from a replay buffer and then provides feedback on states and actions in the trajectory. In order to generalize feedback signals provided by the human operator to previously unseen states and actions at test-time, we use a feedback neural network. We use an ensemble of neural networks with a shared network architecture to represent model uncertainty and the confidence of the neural network in its output. The output of the feedback neural network is converted to a shaping reward that is augmented to the reward provided by the environment. We evaluate our approach on the Bowling and Skiing Atari games in the arcade learning environment. Although human experts have been able to achieve high scores in these environments, state-of-the-art deep learning algorithms perform poorly. We observe that FRESH is able to achieve much higher scores than state-of-the-art deep learning algorithms in both environments. FRESH also achieves a 21.4% higher score than a human expert in Bowling and does as well as a human expert in Skiing.


Single Deep Counterfactual Regret Minimization

Steinberger, Eric

arXiv.org Artificial Intelligence

Counterfactual Regret Minimization (CFR) is the most successful algorithm for finding approximate Nash equilibria in imperfect information games. However, CFR's reliance on full game-tree traversals limits its scalability. For this reason, the game's state- and action-space is often abstracted (i.e. simplified) for CFR, and the resulting strategy is then translated back to the full game, which requires extensive expert-knowledge and often converges to highly exploitable policies. A recently proposed method, Deep CFR, applies deep learning directly to CFR, allowing the agent to intrinsically abstract and generalize over the state-space from samples, without requiring expert knowledge. In this paper, we introduce Single Deep CFR (SD-CFR), a simplified variant of Deep CFR that has a lower overall approximation error by avoiding the training of an average strategy network. We show that SD-CFR is more attractive from a theoretical perspective and empirically outperforms Deep CFR with respect to exploitability and one-on-one play in poker.


Revisiting CFR+ and Alternating Updates

Burch, Neil, Moravcik, Matej, Schmid, Martin

Journal of Artificial Intelligence Research

The CFR+ algorithm for solving imperfect information games is a variant of the popular CFR algorithm, with faster empirical performance on a range of problems. It was introduced with a theoretical upper bound on solution error, but subsequent work showed an error in one step of the proof. We provide updated proofs to recover the original bound.