Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Ryabinin, Max, Malinin, Andrey, Gales, Mark

May-14-2021–arXiv.org Artificial Intelligence

Ensembles of machine learning models yield improved system performance as well as robust and interpretable uncertainty estimates; however, their inference costs may often be prohibitively high. Ensemble Distribution Distillation is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble. For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion. Although theoretically principled, this criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high. In our work, we analyze this effect and show that for the Dirichlet log-likelihood criterion classes with low probability induce larger gradients than high-probability classes. This forces the model to focus on the distribution of the ensemble tail-class probabilities. We propose a new training objective which minimizes the reverse KL-divergence to a Proxy-Dirichlet target derived from the ensemble. This loss resolves the gradient issues of Ensemble Distribution Distillation, as we demonstrate both theoretically and empirically on the ImageNet and WMT17 En-De datasets containing 1000 and 40,000 classes, respectively.

dirichlet distribution, ensemble, ensemble distribution distillation, (13 more...)

arXiv.org Artificial Intelligence

May-14-2021

arXiv.org PDF

Add feedback

Country:
- Asia > Russia (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Belgium (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.14)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (0.70)
  - Representation & Reasoning > Uncertainty
    - Bayesian Inference (0.48)
  - Machine Learning
    - Statistical Learning (0.46)
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found