hl-gaussian
Investigating the Histogram Loss in Regression
Imani, Ehsan, Luedemann, Kai, Scholnick-Hughes, Sam, Elelimy, Esraa, White, Martha
It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than learning a better representation. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Improving Regression Performance with Distributional Losses
There is growing evidence that converting targets to soft targets in supervised learning can provide considerable gains in performance. Much of this work has considered classification, converting hard zero-one values to soft labels---such as by adding label noise, incorporating label ambiguity or using distillation. In parallel, there is some evidence from a regression setting in reinforcement learning that learning distributions can improve performance. In this work, we investigate the reasons for this improvement, in a regression setting. We introduce a novel distributional regression loss, and similarly find it significantly improves prediction accuracy. We investigate several common hypotheses, around reducing overfitting and improved representations. We instead find evidence for an alternative hypothesis: this loss is easier to optimize, with better behaved gradients, resulting in improved generalization. We provide theoretical support for this alternative hypothesis, by characterizing the norm of the gradients of this loss.