Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift