Through the Lens of Core Competency: Survey on Evaluation of Large Language Models
Zhuang, Ziyu, Chen, Qiguang, Ma, Longxuan, Li, Mingda, Han, Yi, Qian, Yushan, Bai, Haopeng, Feng, Zixian, Zhang, Weinan, Liu, Ting
–arXiv.org Artificial Intelligence
From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM's evaluation.
arXiv.org Artificial Intelligence
Aug-15-2023
- Country:
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- North America
- Dominican Republic (0.04)
- United States
- Maryland > Baltimore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.28)
- Colorado > Denver County
- Denver (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Washington > King County
- Seattle (0.28)
- California
- San Francisco County > San Francisco (0.14)
- Santa Clara County > Palo Alto (0.04)
- San Diego County > San Diego (0.04)
- New York > New York County
- New York City (0.04)
- Canada
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.14)
- Alberta > Census Division No. 15
- Improvement District No. 9 > Banff (0.04)
- Europe
- Austria (0.04)
- France (0.04)
- Spain > Valencian Community
- Valencia Province > Valencia (0.04)
- Italy > Tuscany
- Florence (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Cambridgeshire > Cambridge (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.05)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Germany > Saarland
- Saarbrücken (0.04)
- Asia
- Singapore (0.04)
- Middle East
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Qatar > Ad-Dawhah
- Doha (0.04)
- Israel > Tel Aviv District
- Tel Aviv (0.04)
- UAE > Abu Dhabi Emirate
- Japan > Kyūshū & Okinawa
- Okinawa (0.04)
- China
- Hong Kong (0.04)
- Heilongjiang Province > Harbin (0.04)
- Beijing > Beijing (0.04)
- Africa
- Rwanda > Kigali
- Kigali (0.04)
- Ethiopia > Addis Ababa
- Addis Ababa (0.04)
- Rwanda > Kigali
- South America > Colombia
- Genre:
- Overview (1.00)
- Research Report (0.63)
- Industry:
- Education (0.68)
- Technology: