The Science of Evaluating Foundation Models

Yuan, Jiayi, Zhang, Jiamu, Wen, Andrew, Hu, Xia

Feb-12-2025–arXiv.org Artificial Intelligence

The emergent phenomena of large foundation models have revolutionized natural language processing. However, evaluating these models presents significant challenges due to their size, capabilities, and deployment across diverse applications. Existing literature often focuses on individual aspects, such as benchmark performance or specific tasks, but fails to provide a cohesive process that integrates the nuances of diverse use cases with broader ethical and operational considerations. This work focuses on three key aspects: (1) Formalizing the Evaluation Process by providing a structured framework tailored to specific use-case contexts, (2) Offering Actionable Tools and Frameworks such as checklists and templates to ensure thorough, reproducible, and practical evaluations, and (3) Surveying Recent Work with a targeted review of advancements in LLM evaluation, emphasizing real-world applications.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Feb-12-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America
  - United States
    - Texas > Travis County
      - Austin (0.04)
    - Michigan > Washtenaw County
      - Ann Arbor (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
    - California > San Diego County
      - San Diego (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Czechia > Prague (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Taiwan (0.04)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
  - Middle East
    - Jordan (0.04)
    - Yemen > Amran Governorate
      - Amran (0.04)

Genre:
- Research Report (1.00)
- Overview (1.00)
- Workflow (0.93)

Industry:
- Information Technology > Security & Privacy (0.93)
- Health & Medicine (0.68)
- Government (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found