Is this model reliable for everyone? Testing for strong calibration