Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

Zhuang, Ziyu, Chen, Qiguang, Ma, Longxuan, Li, Mingda, Han, Yi, Qian, Yushan, Bai, Haopeng, Feng, Zixian, Zhang, Weinan, Liu, Ting

Aug-15-2023–arXiv.org Artificial Intelligence

From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM's evaluation.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Aug-15-2023

arXiv.org PDF

Add feedback

Country:
- South America > Colombia
  - Meta Department > Villavicencio (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Maryland > Baltimore (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.28)
    - Colorado > Denver County
      - Denver (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Pennsylvania > Philadelphia County
      - Philadelphia (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - Washington > King County
      - Seattle (0.28)
    - California
      - San Francisco County > San Francisco (0.14)
      - Santa Clara County > Palo Alto (0.04)
      - San Diego County > San Diego (0.04)
    - New York > New York County
      - New York City (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.14)
    - Alberta > Census Division No. 15
      - Improvement District No. 9 > Banff (0.04)
- Europe
  - Austria (0.04)
  - France (0.04)
  - Spain > Valencian Community
    - Valencia Province > Valencia (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - United Kingdom > England
    - Oxfordshire > Oxford (0.04)
    - Cambridgeshire > Cambridge (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.05)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
  - Germany > Saarland
    - Saarbrücken (0.04)
- Asia
  - Singapore (0.04)
  - Middle East
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Qatar > Ad-Dawhah
      - Doha (0.04)
    - Israel > Tel Aviv District
      - Tel Aviv (0.04)
  - Japan > Kyūshū & Okinawa
    - Okinawa (0.04)
  - China
    - Hong Kong (0.04)
    - Heilongjiang Province > Harbin (0.04)
    - Beijing > Beijing (0.04)
- Africa
  - Rwanda > Kigali
    - Kigali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Overview (1.00)
- Research Report (0.63)

Industry:
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found