On Subjective Uncertainty Quantification and Calibration in Natural Language Generation
An example of this is question answering (QA): given a question from the user, the model may provide a brief answer, but it may also follow with supporting facts and explanations, which can vary in form and detail. The user can be satisfied by a wide variety of responses, irrespective of their style or (to some extent) the choice of supporting facts included. Free-form NLG poses significant challenges to uncertainty quantification: some aspects of generation are irrelevant to the task's purpose and best excluded from uncertainty quantification, but it often appears that we are unable to characterize them precisely. If left unaddressed, however, the model's variation in the irrelevant aspects may dominate in standard uncertainty measures such as token-level entropy (Kuhn et al., 2023), making them uninformative about the model's actual performance on the task. Starting from Kuhn et al. (2023), a recent line of work (Kuhn et al., 2023; Lin et al., 2024; Zhang et al., 2023; Aichberger et al., 2024) studied this issue and proposed measuring the "semantic uncertainty" of generation; "semantics" is defined as the equivalence class of textual responses that logically entail one another. Empirical improvements in downstream tasks evidenced their contributions and highlighted the importance of task-specific uncertainty quantification, but important conceptual and practical issues remain. From a practical perspective, semantic equivalence is estimated using machine learning models, resulting in imprecise estimates that do not necessarily define an equivalence relation.
Jun-7-2024
- Country:
- Europe
- Italy (0.14)
- United Kingdom > England
- Oxfordshire > Oxford (0.14)
- Europe
- Genre:
- Research Report (0.82)
- Technology: