1663fba7b56da1e96bed6e30546a07b0-Supplemental-Conference.pdf
–Neural Information Processing Systems
Thus,theassumption of the policy being conditionally-independent ofzω givenziα corresponds well to the assumption of agents only using local information (rather than joint information) in MARL to inform their policy/decision-making. Note that we found that cyclically-annealing [82]theβ term in our variational lower bound from0to the values specified in Table 5to help avoid KL-vanishing. A.2.4 ComputationalDetails For MARL trajectory data generation, we used an internal CPU cluster for both the 3-agent hillclimbing and 2-agent coordination domains, using TPUs for only the multiagent MuJoCo data generation. Given a characteristic of interest (e.g., the level of dispersion of agents), we define a training set consisting of joint latentszω and class labelsy (e.g., classes corresponding to different intervals of team returns). Using these definitions, we can gauge the representational power ofzω by learning a mapping g: ˆνc(zω) y. In practice, g is a simple model (e.g., shallow network or linear projection) so as to gauge the expressivity of the latent space.
Neural Information Processing Systems
Feb-7-2026, 15:35:17 GMT
- Technology: