Trustworthy Evaluation of Generative AI Models
Generative models have achieved remarkable success across numerous applications, showcasing their versatility and effectiveness in domains such as image synthesis, natural language processing, and scientific discovery (Achiam et al. 2023; Goodfellow et al. 2014; Karras et al. 2020; Van Den Oord et al. 2016). While extensive research has focused on developing and refining generative models, comparatively less attention has been given to evaluating these models. Evaluating generative models is essential for quantifying the quality of their outputs and identifying the best model when comparing multiple options. Evaluating a generative model is significantly more challenging than the evaluation of a predictor or a classifier. To evaluate the performance of prediction or classification, we can directly compare the model's output with the true label. In contrast, the quality of a generative model is determined by how closely the distribution of its generated data matches that of the input data, rather than the similarity between generated data points and input data points (also known as the reconstruction error).
Jan-31-2025