Leveraging Large Language Models for Multiple Choice Question Answering

Robinson, Joshua, Rytting, Christopher Michael, Wingate, David

Mar-16-2023–arXiv.org Artificial Intelligence

While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective, the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated. Current state of the art (SOTA) methods on many multiple choice question answering (MCQA) tasks involve specialized models, extensive per-task engineering, and individualized tuning in general. What if one model could do just as well as each of these models does individually? This is part of a general vision for so-called foundation models (Bommasani et al., 2021). Foundation models include large pre-trained language models (LLMs) that have derived enough broad knowledge (spanning, for example, linguistic, factual, and commonsense (Liu et al., 2019; Amrami & Goldberg, 2018; Petroni et al., 2020; Bosselut et al.; Bouraoui et al.; Zuo et al., 2018; Bhagavatula et al., 2019)) to transfer from a simple language modelling objective to a huge array of natural language tasks. Interestingly, while LLMs have achieved SOTA results on many tasks, they generally fall short on MCQA. Why is this the case, given their general language modelling prowess as suggested by the low cross-entropy loss they attain with all their parameters, data, and compute (Kaplan et al., 2020; Henighan et al., 2020; Hernandez et al., 2021)?

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-16-2023

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - California (0.28)
  - Minnesota (0.28)

Genre:
- Questionnaire & Opinion Survey (1.00)

Industry:
- Education > Curriculum
  - Subject-Specific Education (1.00)
- Transportation > Ground
  - Rail (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.35)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found