sigmoid
Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts
The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.
Here,wedescribethedetailedrealizationoftheLine-Search&Momentum-PGD(LM-PGD)method. ComparedwiththecommonlyusedPGDmethodoftheformfollowing ฮด
Our PMs are continuous and path-independent, overcoming the deficiencyofpreviousworks[47]. Moreover, there is still room for improvement in our approach and related works. This paper mainly focuses on adversarial robustness regarding white-box attacks generated by the first-order gradient-based methods. When employing our MAIL in real-world applications, it may lead to over-confidence regarding many other attacks, e.g., provable attacks [5], black-box attacks [6], and physical attacks [25]. For data assigned with larger weights, the resulting model would be more robust when encounters similar dataduring thetest. This unfairness problem seems inevitable forareweighted learning framework, which will interest our further study.