Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering