ClimaQA: An Automated Evaluation Framework for Climate Foundation Models

Manivannan, Veeramakali Vignesh, Jafari, Yasaman, Eranky, Srikar, Ho, Spencer, Yu, Rose, Watson-Parris, Duncan, Ma, Yian, Bergen, Leon, Berg-Kirkpatrick, Taylor

Oct-22-2024–arXiv.org Artificial Intelligence

In recent years, foundation models have attracted significant interest in climate science due to their potential to transform how we approach critical challenges such as climate predictions and understanding the drivers of climate change [Thulke et al., 2024, Nguyen et al., 2024, Cao et al., 2024]. However, while these models are powerful, they often fall short when it comes to answering technical questions requiring high precision such as What is the net effect of Arctic stratus clouds on the Arctic climate? Even advanced models like GPT-4 exhibit epistemological inaccuracies in Climate Question-Answering (QA) tasks [Bulian et al., 2024], raising concerns about their reliability in scientific workflows. This highlights the need for a domain-specific evaluation framework to assess the quality and validity of outputs generated by these models. Current benchmarks for Large Language Models (LLMs) predominantly focus on linguistic accuracy or general factual correctness, but they fail to address the unique demands of climate science, where factual rigor, domain-specific knowledge, and robust reasoning are essential.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-22-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.29)

Genre:
- Research Report (1.00)

Industry:
- Education (1.00)
- Government > Regional Government
  - North America Government > United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.91)
  - Natural Language > Large Language Model (1.00)