Goto

Collaborating Authors

 Mueller, David


Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

arXiv.org Artificial Intelligence

As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We demonstrate our method yields verbalized confidences that correlate with observed error rates with a small fine-tuned language model as well as with larger instruction-tuned models, and find that our semantic uncertainty correlates well with lexical uncertainty on short answers.


Where does In-context Translation Happen in Large Language Models

arXiv.org Artificial Intelligence

Prior work on Self-supervised large language models have in-context MT has focused on prompt-engineering, treating demonstrated the ability to perform Machine GPT models as black boxes by focusing on which examples Translation (MT) via in-context learning, but little to provide in-context (Moslem et al., 2023). Agrawal et al. is known about where the model performs (2022) apply similarity-based retrieval to select in-context the task with respect to prompt instructions and examples, while Sia & Duh (2023) suggest a coherencebased demonstration examples. In this work, we attempt approach. However, these works apply surface level to characterize the region where large language interventions leaving the internal mechanism of MT in GPT models transition from in-context learners to translation models largely not understood.


Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

arXiv.org Artificial Intelligence

Traditional multi-task learning architectures train a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks. Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multi-task conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures.


Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast---Choose Three

arXiv.org Machine Learning

Modern neural networks do not always produce well-calibrated predictions, even when trained with a proper scoring function such as cross-entropy. In classification settings, simple methods such as isotonic regression or temperature scaling may be used in conjunction with a held-out dataset to calibrate model outputs. However, extending these methods to structured prediction is not always straightforward or effective; furthermore, a held-out calibration set may not always be available. In this paper, we study ensemble distillation as a general framework for producing well-calibrated structured prediction models while avoiding the prohibitive inference-time cost of ensembles. We validate this framework on two tasks: named-entity recognition and machine translation. We find that, across both tasks, ensemble distillation produces models which retain much of, and occasionally improve upon, the performance and calibration benefits of ensembles, while only requiring a single model during test-time.