Measuring the Robustness of Natural Language Processing Models to Domain Shifts