Partially observable Markov decision processes (POMDPs) provide a principled framework for sequential planning in uncertain single agent settings. An extension of POMDPs to multiagent settings, called interactive POMDPs (I-POMDPs), replaces POMDP belief spaces with interactive hierarchical belief systems which represent an agent's belief about the physical world, about beliefs of other agents, and about their beliefs about others' beliefs. This modification makes the difficulties of obtaining solutions due to complexity of the belief and policy spaces even more acute. We describe a general method for obtaining approximate solutions of I-POMDPs based on particle filtering (PF). We introduce the interactive PF, which descends the levels of the interactive belief hierarchies and samples and propagates beliefs at each level. The interactive PF is able to mitigate the belief space complexity, but it does not address the policy space complexity. To mitigate the policy space complexity -- sometimes also called the curse of history -- we utilize a complementary method based on sampling likely observations while building the look ahead reachability tree. While this approach does not completely address the curse of history, it beats back the curse's impact substantially. We provide experimental results and chart future work.
One issue might be that many people have moved to ALE & OpenAI's Gym interface for API/environment implementations, and Python for implementation language. Your C library makes Python sound like a very second-class citizen, which is discouraging, and C is increasingly disfavored for its complexity & low-level nature. Just to get started with this, one has to learn the'Cassandra POMDP format', whatever that is, and then deal with C rather than Python. Are there that many people who want to solve MDPs in a tabular form whose preferred language is C and love defining their models in Cassandra POMDP format? You also don't have any impressive use-cases or demos of things which one can do easily in AIToolbox which can't be done elsewhere as easily, or as fast, or at all - what gives me any confidence that this is really mature and I won't simply invest days into learning it only to discover some severe limitation which makes it useless for me?
We introduce a natural extension of the notion of swap regret, conditional swap regret, that allows for action modifications conditioned on the player's action history. We prove a series of new results for conditional swap regret minimization. We further extend these results to the case where conditional swaps are considered only for a subset of actions. We also define a new notion of equilibrium, conditional correlated equilibrium, that is tightly connected to the notion of conditional swap regret: when all players follow conditional swap regret minimization strategies, then the empirical distribution approaches this equilibrium. Finally, we extend our results to the multi-armed bandit scenario.
Point-based algorithms and RTDP-Bel are approximate methods for solving POMDPs that replace the full updates of parallel value iteration by faster and more effective updates at selected beliefs. An important difference between the two methods is that the former adopt Sondik's representation of the value function, while the latter uses a tabular representation and a discretization function. The algorithms, however, have not been compared up to now, because they target different POMDPs: discounted POMDPs on the one hand, and Goal POMDPs on the other. In this paper, we bridge this representational gap, showing how to transform discounted POMDPs into Goal POMDPs, and use the transformation to compare RTDP-Bel with point-based algorithms over the existing discounted benchmarks. The results appear to contradict the conventional wisdom in the area showing that RTDP-Bel is competitive, and sometimes superior to point-based algorithms in both quality and time.