A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Laskar, Md Tahmid Rahman, Alqahtani, Sawsan, Bari, M Saiful, Rahman, Mizanur, Khan, Mohammad Abdullah Matin, Khan, Haidar, Jahan, Israt, Bhuiyan, Amran, Tan, Chee Wei, Parvez, Md Rizwan, Hoque, Enamul, Joty, Shafiq, Huang, Jimmy
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.
arXiv.org Artificial Intelligence
Jul-4-2024
- Country:
- South America > Argentina (0.04)
- Oceania > Australia
- North America
- Dominican Republic (0.04)
- United States > California
- Santa Clara County > Palo Alto (0.04)
- Canada > Ontario
- Toronto (0.04)
- Europe
- United Kingdom (0.04)
- Russia (0.04)
- France (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Asia
- Singapore (0.04)
- Russia (0.04)
- Indonesia > Bali (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Middle East
- Jordan (0.04)
- Saudi Arabia (0.04)
- Qatar (0.04)
- Yemen > Amran Governorate
- Amran (0.04)
- China
- Hong Kong (0.04)
- Guangxi Province > Nanning (0.04)
- Genre:
- Research Report (1.00)
- Overview (1.00)
- Industry:
- Information Technology (0.67)
- Leisure & Entertainment (0.47)
- Education (0.46)
- Banking & Finance (0.45)
- Technology: