CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models

Oct-4-2022–arXiv.org Artificial Intelligence

We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. We present the data set design and benchmark that supports scoring against a crowd-validated human baseline. We test six state-of-the-art models against our benchmark. Our results show that it poses a valid challenge for these models and opens up considerable room for their improvement.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

Oct-4-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - New York > Tompkins County
      - Ithaca (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Bulgaria (0.04)
  - Austria (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
    - Oxfordshire > Oxford (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
  - Germany > Saxony
    - Leipzig (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Japan > Honshū
    - Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Commonsense Reasoning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found