Why do policy gradient methods work so well in cooperative MARL? Evidence from policy representation

Open in new window