Energy-based Automated Model Evaluation
Peng, Ru, Zou, Heming, Wang, Haobo, Zeng, Yawen, Huang, Zenan, Zhao, Junbo
–arXiv.org Artificial Intelligence
The conventional evaluation protocols on machine learning models rely heavily on a labeled, i.i.d-assumed testing dataset, which is not often present in real-world applications. The Automated Model Evaluation (AutoEval) shows an alternative to this traditional workflow, by forming a proximal prediction pipeline of the testing performance without the presence of ground-truth labels. Despite its recent successes, the AutoEval frameworks still suffer from an overconfidence issue, substantial storage and computational cost. In that regard, we propose a novel measure -- Meta-Distribution Energy (MDE) that allows the AutoEval framework to be both more efficient and effective. The core of the MDE is to establish a meta-distribution statistic, on the information (energy) associated with individual samples, then offer a smoother representation enabled by energy-based learning. We further provide our theoretical insights by connecting the MDE with the classification loss. We provide extensive experiments across modalities, datasets and different architectural backbones to validate MDE's validity, together with its superiority compared with prior approaches. We also prove MDE's versatility by showing its seamless integration with large-scale models, and easy adaption to learning scenarios with noisy-or imbalanced-labels. Model evaluation grows critical in research and practice along with the tremendous advances of machine learning techniques. To do that, the standard evaluation is to evaluate a model on a pre-split test set that is i)-fully labeled; ii)-drawn i.i.d. However, this conventional way may fail in real-world scenarios, where there often encounter distribution shifts and the absence of ground-truth labels. In those environments with distribution shifts, the performance of a trained model may vary significantly (Quinonero-Candela et al., 2008; Koh et al., 2021b), thereby limiting in-distribution accuracy as a weak indicator of the model's generalization performance. Moreover, traditional cross-validation (Arlot & Celisse, 2010) and annotating samples are both laborious tasks, rendering it impractical to split or label every test set in the wild.
arXiv.org Artificial Intelligence
Jan-24-2024