Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization
Cao, Kaidi, Chen, Yining, Lu, Junwei, Arechiga, Nikos, Gaidon, Adrien, Ma, Tengyu
In real-world machine learning applications, even well-curated training datasets have various types of heterogeneity. Two main types of heterogeneity are: (1) data imbalance: the input or label distribution often has a long-tailed density, and (2) heteroskedasticity: the labels given inputs have varying levels of uncertainties across subsets of data stemming from various sources such as the intrinsic ambiguity of the data or annotation errors. Many deep learning algorithms have been proposed for imbalanced datasets (e.g., see [Wang et al., 2017, Cao et al., 2019, Cui et al., 2019, Liu et al., 2019] and the reference therein). However, heteroskedasticity, a classical notion studied extensively in the statistical community [Pintore et al., 2006, Wang et al., 2013, Tibshirani et al., 2014], has so far been under-explored in deep learning. This paper focuses on addressing heteroskedasticity and its interaction with data imbalance in deep learning. Heteroskedasticity is often studied in regression analysis and refers to the property that the distribution of the error varies across inputs. In this work, we mostly focus on classification, though the developed technique also applies to regression. Here, heteroskedasticity reflects how the uncertainty in the conditional distribution p(y x), or the entropy of y x, varies as a function of x .
Jun-28-2020