Risk Assessment and Statistical Significance in the Age of Foundation Models

Nitsure, Apoorva, Mroueh, Youssef, Rigotti, Mattia, Greenewald, Kristjan, Belgodere, Brian, Yurochkin, Mikhail, Navratil, Jiri, Melnyk, Igor, Ross, Jerret

Jan-9-2024–arXiv.org Machine Learning

Foundation models such as large language models (LLMs) have shown remarkable capabilities redefining the field of artificial intelligence. At the same time, they present pressing and challenging socio-technical risks regarding the trustworthiness of their outputs and their alignment with human values and ethics [Bommasani et al., 2021]. Evaluating LLMs is therefore a multi-dimensional problem, where those risks are assessed across diverse tasks and domains [Chang et al., 2023]. In order to quantify these risks, Liang et al. [2022], Wang et al. [2023], Huang et al. [2023] proposed benchmarks of automatic metrics for probing the trustworthiness of LLMs. These metrics include accuracy, robustness, fairness, toxicity of the outputs, etc. Human evaluation benchmarks can be even more nuanced, and are often employed when tasks surpass the scope of standard metrics. Notable benchmarks based on human and automatic evaluations include, among others, Chatbot Arena [Zheng et al., 2023], HELM [Bommasani et al., 2023], MosaicML's Eval, Open LLM Leaderboard [Wolf, 2023], and BIG-bench [Srivastava et al., 2022], each catering to specific evaluation areas such as chatbot performance, knowledge assessment, and domain-specific challenges. Traditional metrics, however, sometimes do not correlate well with human judgments.

machine learning, natural language, stochastic dominance, (14 more...)

arXiv.org Machine Learning

Jan-9-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.41)

Industry:
- Energy > Oil & Gas
  - Upstream (0.34)
- Information Technology > Security & Privacy (0.41)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Performance Analysis
    - Accuracy (0.48)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found