Multi-Head Mixture-of-Experts Xun Wu, Shaohan Huang, Wenhui Wang, Shuming Ma, Li Dong, Furu Wei Microsoft Research Asia
However, it exhibits the low expert activation issue, i.e., only a small subset of experts are activated for optimization, leading to suboptimal performance and limiting its effectiveness in learning a larger number of experts in complex tasks. In this paper, we propose Multi-Head Mixture-of-Experts (MH-MoE). MH-MoE split each input token into multiple sub-tokens, then these sub-tokens are assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The above operations enables MH-MoE to significantly enhance expert activation while collectively attend to information from various representation spaces within different experts to deepen context understanding. Besides, it's worth noting that our MH-MoE is straightforward to implement and decouples from other SMoE frameworks, making it easy to integrate with these frameworks for enhanced performance. Extensive experimental results across different parameter scales (300M to 7B) and three pre-training tasks--English-focused language modeling, multi-lingual language modeling and masked multi-modality modeling--along with multiple downstream validation tasks, demonstrate the effectiveness of MH-MoE.
Policy Space Diversity for Non-Transitive Games
Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
A Additional Discussions and Details
A.1 Limitations One potential limitation of DyGFormer lies in the ignorance of high-order relationships between nodes since it solely learns from the first-hop interactions of nodes. In certain scenarios where nodes' high-order relationships are essential, DyGFormer may be suboptimal compared with baselines that learn the higher-order interactions. However, trivially feeding the multi-hop neighbors of nodes into DyGFormer would incur expensive computational costs. It is promising to design more efficient and effective frameworks to model nodes' high-order relationships for dynamic graph learning. Another potential limitation is the sensitivity of the neighbor co-occurrence encoding scheme against different negative sampling strategies (discussed in Section 5.7).
Towards a Unified Framework of Contrastive Learning for Disentangled Representations
Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.
Get 60% off a lifetime of flights and hotels for life
If you've ever booked a hotel or flight only to see the price drop the very next day, or have been putting your travel dreams on hold altogether because of the prices, your worries are over. OneAir is a members-only, all-in-one AI-powered travel platform built for modern travelers. Their intelligent search engine scans and tracks millions of flight and hotel deals from your home airport to leading global destinations. Members receive instant mobile and email alerts as soon as travel prices drop. A lifetime subscription is now available to new users for just 59.99 when you use code FLY30 at checkout.