Markov Models
Review for NeurIPS paper: Calibration of Shared Equilibria in General Sum Partially Observable Markov Games
The paper was refereed by 4 knowledgeable reviewers. All reviewers appreciated the contributions of the paper: - Formalization of self play and formal proof when it is guaranteed to converge - New algorithm for calibrating equilibria that is more effective than a naive use of BO. - Convincing results on a market agent scenario. The biggest concern that was discussed between the reviewers was the assumption of the extended transitivity. While this was addressed partially in the rebuttal, the authors should add a longer discussion in the paper for which games this assumption holds. However, after the discussion all reviewers agreed that the paper merits acceptance and I join this decision.
Exploratory Mean-Variance Portfolio Optimization with Regime-Switching Market Dynamics
Chen, Yuling Max, Li, Bin, Saunders, David
Considering the continuous-time Mean-Variance (MV) portfolio optimization problem, we study a regime-switching market setting and apply reinforcement learning (RL) techniques to assist informed exploration within the control space. We introduce and solve the Exploratory Mean Variance with Regime Switching (EMVRS) problem. We also present a Policy Improvement Theorem. Further, we recognize that the widely applied Temporal Difference (TD) learning is not adequate for the EMVRS context, hence we consider Orthogonality Condition (OC) learning, leveraging the martingale property of the induced optimal value function from the analytical solution to EMVRS. We design a RL algorithm that has more meaningful parameterization using the market parameters and propose an updating scheme for each parameter. Our empirical results demonstrate the superiority of OC learning over TD learning with a clear convergence of the market parameters towards their corresponding ``grounding true" values in a simulated market scenario. In a real market data study, EMVRS with OC learning outperforms its counterparts with the highest mean and reasonably low volatility of the annualized portfolio returns.
Inverse Reinforcement Learning via Convex Optimization
Zhu, Hao, Zhang, Yuan, Boedecker, Joschka
We consider the inverse reinforcement learning (IRL) problem, where an unknown reward function of some Markov decision process is estimated based on observed expert demonstrations. In most existing approaches, IRL is formulated and solved as a nonconvex optimization problem, posing challenges in scenarios where robustness and reproducibility are critical. We discuss a convex formulation of the IRL problem (CIRL) initially proposed by Ng and Russel, and reformulate the problem such that the domain-specific language CVXPY can be applied directly to specify and solve the convex problem. We also extend the CIRL problem to scenarios where the expert policy is not given analytically but by trajectory as state-action pairs, which can be strongly inconsistent with optimality, by augmenting some of the constraints. Theoretical analysis and practical implementation for hyperparameter auto-selection are introduced. This note helps the users to easily apply CIRL for their problems, without background knowledge on convex optimization.
Review for NeurIPS paper: POMDPs in Continuous Time and Discrete Spaces
Summary and Contributions: Post-rebuttal: I would like to thank the authors for the thoughtful response. The main issue for me was clarity, and I'm happy that the authors agreed to improve this aspect of the paper. However, it's hard to increase my score based on this promise alone. Nevertheless, my recommendation should really be considered a borderline recommendation. I will not fight against accepting this paper. This involves both filtering and control.
Review for NeurIPS paper: POMDPs in Continuous Time and Discrete Spaces
The paper describes new offline and online techniques to optimize the policy of continuous time discrete state and action POMDPs. This paper makes an important contribution to the RL and control literature. Very little work has focused on continuous time control problems in the ML community. While the techniques assume that the model is known, do not scale to high dimensional problems and were tested only on toy problems, they introduce new formalisms that will help the community get familiar with the mathematics of continuous time control. Hence this paper will be of high interest for the RL community.
Reviews: Estimating Convergence of Markov chains with L-Lag Couplings
The authors generalize 1-lag coupling of the chains to L-lag coupling and provide upper bounds on some distribution distances including the total variation and 1-Wasserstein distance. This bound serves as a convergence check for MCMC, e.g., to stop the burn-in phase. The main contributions of the paper are 1) deriving a computable bound of the distribution distance between two (L-lagged) chains, and 2) presenting algorithms (e.g., Coupled Random-Walk Metropolis-Hastings, Coupled HMC, etc.) using the bound as a stopping criterion for burn-in. Unfortunately, the second part together with the proof of the bound is in the supplementary material. The presented bound and method to compute it is, to the best of knowledge, novel and significantly extends the state-of-the-art.
Reviews: Estimating Convergence of Markov chains with L-Lag Couplings
After discussion, all agree that this paper makes a significant contribution and merits acceptance. These results on estimating MCMC convergence with L-lag couplings will be of broad interest to the NeurIPS community. Please take the reviewers' constructive feedback into account and follow through on your promises to improve the paper as stated in the rebuttal.
Review for NeurIPS paper: Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits
I must first admit that judging this paper was a fairly challenging task given the mixed opinions expressed by the reviewers, together with my own impressions after having scrutinized the manuscript in detail. The reviewers largely agree that the paper deserves credit as it tackles the challenging, relevant and (relatively) scarcely studied topic of restless bandit learning. I believe the main value of the paper is in the introduction of the birth-death Markov chain structure for arms of a restless bandit, together with the monotonicity and positive correlation assumptions on rewards and transitions. These are not unnatural assumptions, as evidenced by modeling literature on scheduling over wireless channels and queueing systems, and seem to greatly alleviate the computational complexity of a portion of the learning process. On the other hand, the reviewers are not fully convinced about the significance of the proposed algorithm and regret bound proven in the paper, given that the analysis is carried out for a highly structured ensemble of Markov decision processes.
Review for NeurIPS paper: Instance-based Generalization in Reinforcement Learning
Weaknesses: The paper lacks many intricate details that prevents the reader to judge the novelty and full contribution of the work. After reading the rebuttal, an overview of the proposed solution and the problem setting would be of much help to the readers. Is the entire game (with all levels) considered as a POMDP? I see sentences such as "Line 62: environment is considered as a markov process". How is the generalization problem being modelled?
Review for NeurIPS paper: Belief-Dependent Macro-Action Discovery in POMDPs using the Value of Information
Weaknesses: The work is not well presented. Terms like open-loop actions, closed-loop policies, and reachable belief space were used without definitions provided. As a result, the reviewer had difficulties understanding Figures 1 and 2. Value of information is the key of this work, but was only briefly discussed in Section 4.1. The major concern is on the evaluation of the developed methods. The POMDP community has provided a number of benchmark problems.