Leveraging Large Language Models for Multiple Choice Question Answering
Robinson, Joshua, Rytting, Christopher Michael, Wingate, David
–arXiv.org Artificial Intelligence
While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective, the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated. Current state of the art (SOTA) methods on many multiple choice question answering (MCQA) tasks involve specialized models, extensive per-task engineering, and individualized tuning in general. What if one model could do just as well as each of these models does individually? This is part of a general vision for so-called foundation models (Bommasani et al., 2021). Foundation models include large pre-trained language models (LLMs) that have derived enough broad knowledge (spanning, for example, linguistic, factual, and commonsense (Liu et al., 2019; Amrami & Goldberg, 2018; Petroni et al., 2020; Bosselut et al.; Bouraoui et al.; Zuo et al., 2018; Bhagavatula et al., 2019)) to transfer from a simple language modelling objective to a huge array of natural language tasks. Interestingly, while LLMs have achieved SOTA results on many tasks, they generally fall short on MCQA. Why is this the case, given their general language modelling prowess as suggested by the low cross-entropy loss they attain with all their parameters, data, and compute (Kaplan et al., 2020; Henighan et al., 2020; Hernandez et al., 2021)?
arXiv.org Artificial Intelligence
Mar-16-2023
- Country:
- Europe (1.00)
- North America > United States
- California (0.28)
- Minnesota (0.28)
- Genre:
- Industry:
- Education > Curriculum
- Subject-Specific Education (1.00)
- Transportation > Ground
- Rail (0.67)
- Education > Curriculum
- Technology: