Evaluating Model Performance Under Worst-case Subpopulations