Sampling Preferences Yields Simple Trustworthiness Scores

Steinle, Sean

arXiv.org Artificial Intelligence 

--With the onset of large language models (LLMs), the performance of artificial intelligence (AI) models is becoming increasingly multi-dimensional. Accordingly, there have been several large, multi-dimensional evaluation frameworks put forward to evaluate LLMs. Though these frameworks are much more realistic than previous attempts which only used a single score like accuracy, multi-dimensional evaluations can complicate decision-making since there is no obvious way to select an optimal model. This work introduces preference sampling, a method to extract a scalar trustworthiness score from multi-dimensional evaluation results by considering the many characteristics of model performance which users value. We show that preference sampling improves upon alternate aggregation methods by using multi-dimensional trustworthiness evaluations of LLMs from TrustLLM and DecodingTrust. We find that preference sampling is consistently reductive, fully reducing the set of candidate models 100% of the time whereas Pareto optimality never reduces the set by more than 50%. Likewise, preference sampling is consistently sensitive to user priors--allowing users to specify the relative weighting and confidence of their preferences--whereas averaging scores is intransigent to users' prior knowledge. With the recent rapid scaling of AI models, our trust in AI is no longer proportional to any single measure of system performance. Because new types of AI like LLMs can perform many types of tasks, a new suite of metrics is replacing singular error metrics like accuracy to capture aspects of model behavior like hallucination, unsafe recommendations, and alignment. This follows from existing work which suggests that trustworthiness is a function of a set of characteristics like fairness, safety, privacy, and so on [1], [11], [18]. Though there is no consensus on the exact characteristics of trustworthiness, it is clear that the relative value of the characteristics is domain-specific [18] and there is already work on defining and quantifying these characteristics in the context of large language models [7], [12], [21].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found