Calibrated and uncertain? Evaluating uncertainty estimates in binary classification models