sigmoid
Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts
The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- North America > United States > California > Yolo County > Davis (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.67)
- Information Technology > Software (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
On the Expressive Power of Deep Polynomial Neural Networks
We thank the reviewers for their positive and useful comments. "close" to the variety (nearness could be formalized by considering tubular neighborhoods of the variety). In fact, motivated by the reviewers' " See our detailed answer above. " In addition to our general answer above, we believe that the algebraic framework may be Specifically, one could consider "empirical filling", i.e., whether I suggest the authors emphasize this aspect in the paper . " Our computations are based on finite The focus of this paper was on expressivity, and we only briefly touched upon optimization/learning in Sec.2.3.
- North America > United States (0.14)
- Europe > Greece (0.04)
- Asia > Nepal (0.04)
- Asia > Middle East > Jordan (0.04)
- Government (0.48)
- Information Technology (0.47)
- North America > Canada (0.04)
- Asia > Singapore (0.04)
To Reviewer1: 1. Method simplistic, places too much constraints on activation (only ReLU-like activations)
We believe the proposed H-regularization is novel and by no means simplistic. It is well suited for one-class learning. ReLU-like activations are widely used, e.g., Transformer, Resnet, etc. It does not affect the application of our method. In our experiments, we followed baselines and used the same datasets as them.
Sample Complexity of Algorithm Selection Using Neural Networks and Its Applications to Branch-and-Cut
We then apply this approach to the problem of making good decisions in the branch-and-cut framework for mixed-integer optimization (e.g., which cut to add?). In other words, the neural network will take as input a mixed-integer optimization instance and output a decision that will result in a small branch-and-cut tree for that instance.
- North America > United States > Maryland > Baltimore (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)