Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation

Shrivastava, Vaishnavi, Liang, Percy, Kumar, Ananya

Nov-15-2023–arXiv.org Artificial Intelligence

To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4). As large language models (LLMs) are increasingly deployed, it is important that they signal low confidence on examples where they are likely to make mistakes. This paper's goal is to produce good confidence estimates for state-of-the-art LLMs, which do not provide model probabilities or representations (such as GPT-4 and Claude-v1.3). We first examine a natural idea of eliciting linguistic confidence scores (Tian et al., 2023; Lin et al., 2022; Xiong et al., 2023) -- prompting the LLM to assess its confidence in its answer (Figure 1, GPT-4 Linguistic). We find that linguistic confidences work reasonably well for state-of-the-art models, and much better than a random guessing baseline, but still leave room for improvement (Section 3). Averaged across the datasets, GPT-4 achieves a selective classification AUC of 80.5%, which is 7% above a random guessing baseline. Our results hold across 12 standard datasets (8 MMLU datasets, TruthfulQA, CommonsenseQA, OpenbookQA, and MedQA), 5 models (GPT-4, Claude-v1.3,

confidence score, linguistic confidence, probability, (17 more...)

arXiv.org Artificial Intelligence

Nov-15-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:
- Research Report
  - Promising Solution (0.54)
  - New Finding (0.34)

Industry:
- Information Technology > Security & Privacy (0.69)
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)