Evaluating model performance under worst-case subpopulations