Changing Answer Order Can Decrease MMLU Accuracy
Gupta, Vipul, Pantoja, David, Ross, Candace, Williams, Adina, Ung, Megan
–arXiv.org Artificial Intelligence
For can affect multiple choice tests, for example, example, NLP model accuracy has been shown to when answers are presented in a different order be fairly brittle. For example, accuracy can drop during retest (Krosnick and Fabrigar, 1991; when researchers apply input alterations based Tellinghuisen and Sulikowski, 2008; Lions et al., on paraphrasing (Gan and Ng, 2019), word order 2022). However, as models do not have the biological changes (Gauthier and Levy, 2019; Ribeiro et al., limitations of humans, we may expect them 2020; Sinha et al., 2021a, 2022; Allen-Zhu and Li, to exhibit less variation than humans, or possibly 2023a,b; Berglund et al., 2023; Golovneva et al., even none at all. Thus, we claim that a model 2024; Kitouni et al., 2024), or other minor, largely should be robust to answer order changes: if it gets meaning-preserving input variations or perturbations the correct answer to a question when the answer (Belinkov and Bisk, 2018; Ebrahimi et al., is labeled'A', it should also always get the correct 2018; Jiang et al., 2020; Gao et al., 2021; Li et al., answer when it is labeled'C'. Put another way, 2021; Sinha et al., 2021b; Moradi and Samwald, the model should select the same answer for each 2021; Papakipos and Bitton, 2022; Qian et al., question, regardless of its label, for every possible 2022; Goodarzi et al., 2023; Sinha et al., 2023).
arXiv.org Artificial Intelligence
Jun-27-2024
- Country:
- Asia > Middle East
- UAE (0.14)
- North America
- Canada (0.28)
- United States > California (0.14)
- Asia > Middle East
- Genre:
- Research Report (0.82)
- Industry:
- Education (1.00)
- Technology: