Goto

Collaborating Authors

 decodingtrust


DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Neural Information Processing Systems

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance - where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/.


Sampling Preferences Yields Simple Trustworthiness Scores

Steinle, Sean

arXiv.org Artificial Intelligence

--With the onset of large language models (LLMs), the performance of artificial intelligence (AI) models is becoming increasingly multi-dimensional. Accordingly, there have been several large, multi-dimensional evaluation frameworks put forward to evaluate LLMs. Though these frameworks are much more realistic than previous attempts which only used a single score like accuracy, multi-dimensional evaluations can complicate decision-making since there is no obvious way to select an optimal model. This work introduces preference sampling, a method to extract a scalar trustworthiness score from multi-dimensional evaluation results by considering the many characteristics of model performance which users value. We show that preference sampling improves upon alternate aggregation methods by using multi-dimensional trustworthiness evaluations of LLMs from TrustLLM and DecodingTrust. We find that preference sampling is consistently reductive, fully reducing the set of candidate models 100% of the time whereas Pareto optimality never reduces the set by more than 50%. Likewise, preference sampling is consistently sensitive to user priors--allowing users to specify the relative weighting and confidence of their preferences--whereas averaging scores is intransigent to users' prior knowledge. With the recent rapid scaling of AI models, our trust in AI is no longer proportional to any single measure of system performance. Because new types of AI like LLMs can perform many types of tasks, a new suite of metrics is replacing singular error metrics like accuracy to capture aspects of model behavior like hallucination, unsafe recommendations, and alignment. This follows from existing work which suggests that trustworthiness is a function of a set of characteristics like fairness, safety, privacy, and so on [1], [11], [18]. Though there is no consensus on the exact characteristics of trustworthiness, it is clear that the relative value of the characteristics is domain-specific [18] and there is already work on defining and quantifying these characteristics in the context of large language models [7], [12], [21].


ELAB: Extensive LLM Alignment Benchmark in Persian Language

Pourbahman, Zahra, Rajabi, Fatemeh, Sadeghi, Mohammadhossein, Ghahroodi, Omid, Bakhshaei, Somaye, Amini, Arash, Kazemi, Reza, Baghshah, Mahdieh Soleymani

arXiv.org Artificial Intelligence

This paper presents a comprehensive evaluation framework for aligning Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. It addresses the gaps in existing LLM evaluation frameworks by adapting them to Persian linguistic and cultural contexts. This benchmark creates three types of Persian-language benchmarks: (i) translated data, (ii) new data generated synthetically, and (iii) new naturally collected data. We translate Anthropic Red Teaming data, AdvBench, HarmBench, and DecodingTrust into Persian. Furthermore, we create ProhibiBench-fa, SafeBench-fa, FairBench-fa, and SocialBench-fa as new datasets to address harmful and prohibited content in indigenous culture. Moreover, we collect extensive dataset as GuardBench-fa to consider Persian cultural norms. By combining these datasets, our work establishes a unified framework for evaluating Persian LLMs, offering a new approach to culturally grounded alignment evaluation. A systematic evaluation of Persian LLMs is performed across the three alignment aspects: safety (avoiding harmful content), fairness (mitigating biases), and social norms (adhering to culturally accepted behaviors). We present a publicly available leaderboard that benchmarks Persian LLMs with respect to safety, fairness, and social norms at: https://huggingface.co/spaces/MCILAB/LLM_Alignment_Evaluation.


DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Neural Information Processing Systems

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance – where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history.


Interview with Bo Li: A comprehensive assessment of trustworthiness in GPT models

AIHub

Bo Li and colleagues won an outstanding datasets and benchmark track award at NeurIPS 2023 for their work DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In this interview, Bo tells us about the research, the team's methodology, and key findings. We focus on assessing the safety and risks of foundation models. In particular, we provide the first comprehensive trustworthiness evaluation platform for large language models (LLMs). Given the wide adoption of LLMs, it is critical to understand their safety and risks in different scenarios before large deployments in the real world.