Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods

Open in new window