ClimaQA: An Automated Evaluation Framework for Climate Foundation Models

Manivannan, Veeramakali Vignesh, Jafari, Yasaman, Eranky, Srikar, Ho, Spencer, Yu, Rose, Watson-Parris, Duncan, Ma, Yian, Bergen, Leon, Berg-Kirkpatrick, Taylor

arXiv.org Artificial Intelligence 

In recent years, foundation models have attracted significant interest in climate science due to their potential to transform how we approach critical challenges such as climate predictions and understanding the drivers of climate change [Thulke et al., 2024, Nguyen et al., 2024, Cao et al., 2024]. However, while these models are powerful, they often fall short when it comes to answering technical questions requiring high precision such as What is the net effect of Arctic stratus clouds on the Arctic climate? Even advanced models like GPT-4 exhibit epistemological inaccuracies in Climate Question-Answering (QA) tasks [Bulian et al., 2024], raising concerns about their reliability in scientific workflows. This highlights the need for a domain-specific evaluation framework to assess the quality and validity of outputs generated by these models. Current benchmarks for Large Language Models (LLMs) predominantly focus on linguistic accuracy or general factual correctness, but they fail to address the unique demands of climate science, where factual rigor, domain-specific knowledge, and robust reasoning are essential.