HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F. Fei-Fei, Michael Bernstein

Neural Information Processing Systems 

Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism.