Our Evaluation Metric Needs an Update to Encourage Generalization
Mishra, Swaroop, Arunkumar, Anjana, Bryan, Chris, Baral, Chitta
–arXiv.org Artificial Intelligence
Models that surpass human performance on several popular benchmarks display significant degradation Several approaches have been proposed to address this issue in performance on exposure to Out of Distribution at various levels: (i) Data - filtering of biases (Bras et al., (OOD) data. Recent research has shown 2020; Li & Vasconcelos, 2019; Li et al., 2018; Wang et al., that models overfit to spurious biases and'hack' 2018), quantifying data quality, controlling data quality, using datasets, in lieu of learning generalizable features active learning, and avoiding the creation of low quality like humans. In order to stop the inflation in data (Mishra et al., 2020; Nie et al., 2019; Gardner et al., model performance - and thus overestimation in 2020; Kaushik et al., 2019), and (ii) Model - utilizing prior AI systems' capabilities - we propose a simple knowledge of biases to train a naive model exploiting biases, and novel evaluation metric, WOOD Score, that and then subsequently training an ensemble of the naive encourages generalization during evaluation.
arXiv.org Artificial Intelligence
Jul-14-2020