Goto

Collaborating Authors

 sigmoid


Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

Neural Information Processing Systems

The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as $\mathrm{ReLU}$ and $\mathrm{GELU}$ enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.





On the Expressive Power of Deep Polynomial Neural Networks

Neural Information Processing Systems

We thank the reviewers for their positive and useful comments. "close" to the variety (nearness could be formalized by considering tubular neighborhoods of the variety). In fact, motivated by the reviewers' " See our detailed answer above. " In addition to our general answer above, we believe that the algebraic framework may be Specifically, one could consider "empirical filling", i.e., whether I suggest the authors emphasize this aspect in the paper . " Our computations are based on finite The focus of this paper was on expressivity, and we only briefly touched upon optimization/learning in Sec.2.3.





To Reviewer1: 1. Method simplistic, places too much constraints on activation (only ReLU-like activations)

Neural Information Processing Systems

We believe the proposed H-regularization is novel and by no means simplistic. It is well suited for one-class learning. ReLU-like activations are widely used, e.g., Transformer, Resnet, etc. It does not affect the application of our method. In our experiments, we followed baselines and used the same datasets as them.


Sample Complexity of Algorithm Selection Using Neural Networks and Its Applications to Branch-and-Cut

Neural Information Processing Systems

We then apply this approach to the problem of making good decisions in the branch-and-cut framework for mixed-integer optimization (e.g., which cut to add?). In other words, the neural network will take as input a mixed-integer optimization instance and output a decision that will result in a small branch-and-cut tree for that instance.