AI Benchmarks and Datasets for LLM Evaluation
Ivanov, Todor, Penchev, Valeri
–arXiv.org Artificial Intelligence
LLMs demand significant computational resources for both pre-training and fine-tuning, requiring distributed computing capabilities due to their large model sizes \cite{sastry2024computing}. Their complex architecture poses challenges throughout the entire AI lifecycle, from data collection to deployment and monitoring \cite{OECD_AIlifecycle}. Addressing critical AI system challenges, such as explainability, corrigibility, interpretability, and hallucination, necessitates a systematic methodology and rigorous benchmarking \cite{guldimann2024complai}. To effectively improve AI systems, we must precisely identify systemic vulnerabilities through quantitative evaluation, bolstering system trustworthiness. The enactment of the EU AI Act \cite{EUAIAct} by the European Parliament on March 13, 2024, establishing the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems, further underscores the importance of tools and methodologies such as Z-Inspection. It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems. To this end, we have launched a project that is part of the AI Safety Bulgaria initiatives \cite{AI_Safety_Bulgaria}, aimed at collecting and categorizing AI benchmarks. This will enable practitioners to identify and utilize these benchmarks throughout the AI system lifecycle.
arXiv.org Artificial Intelligence
Dec-1-2024
- Country:
- North America
- United States
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > San Francisco County
- San Francisco (0.14)
- New York > New York County
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- United States
- Europe
- Asia
- Middle East > Jordan (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- North America
- Genre:
- Research Report (0.82)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Government (0.93)
- Law (0.68)
- Health & Medicine > Therapeutic Area
- Education > Curriculum
- Subject-Specific Education (0.67)
- Technology: