DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets Supplementary Material Anonymous Author(s) Affiliation Address email
–Neural Information Processing Systems
Here we provide theoretical evidence that vanilla MoE do not6 guarantee convergence when mixing multiple datasets. Consider a binary classification problem over P-patch inputs where each8 patch has d dimensions and label y = { 1}. Thus, a labeled data point (x,y) has input x =9 (x(1),x(2),x(3),...,x(P)) (Rd)P is a collection of P patch inputs with y as the data label. The10 data x is generated from K clusters.11 Chen et al. [2022] proves that in such a binary-classification problem, an MoE layer converges to an12 o(1) test loss and zero training loss.
artificial intelligence, machine learning, mixture-of-dataset supplementary material anonymous author, (11 more...)
Neural Information Processing Systems
Apr-30-2026, 00:06:07 GMT
- Technology: