Reviews: MAVEN: Multi-Agent Variational Exploration

Neural Information Processing Systems 

The Starcraft results also seem fine, but not so strong as it make it obvious that committed exploration is a crucial empirical improvement for QMIX - while MAVEN agents learn faster in 3s5z, the final performance looks the same; MAVEN agents seem to have less variability in final win rate on 5m_vs_6m; and QMIX actually seems to have better final performance on 10m_vs_11m. The results in figure 2 and 4 do however suggest that there may be scenarios where the advantage of MAVEN is higher. Minor comments: 1) line 64 and others: the subscript "qmix" should probably be wrapped in a "\text{}" 2) first eqn in section 3: inconsistency between using subscripts and superscripts, i.e. u_i and u i 3) line 81: perhaps better phrased as: "the *best* action of agent i..." 4) line 86: u_n i - u_ U i? 5) line 87: I was confused by what "the set of all possible such orderings over the action-values" means. Besides a degeneracy when some of the Q values are identical, isn't there only one valid ordering? Or are you just trying to cover that degeneracy? 6) Definition 1: perhaps add an intuitive explanation, e.g. "Intuitively, a Q-function is non-monotonic if the ordering of best actions for agent i can be affected by the other agents action choices at that time step."