Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Oct-9-2024, 23:08:28 GMT–Neural Information Processing Systems

Ensembles of machine learning models yield improved system performance as well as robust and interpretable uncertainty estimates; however, their inference costs can be prohibitively high. Ensemble Distribution Distillation (EnD 2) is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble. For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion. Although theoretically principled, this work shows that the criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high. Specifically, we show that for the Dirichlet log-likelihood criterion classes with low probability induce larger gradients than high-probability classes.

ensemble distribution distillation, proxy target, scaling ensemble distribution distillation, (1 more...)

Neural Information Processing Systems

Oct-9-2024, 23:08:28 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)