EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries

Feb-22-2024–arXiv.org Artificial Intelligence

While Large Language Models (LLMs) excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Feb-22-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Hong Kong (0.04)
  - Japan > Honshū
    - Kansai > Osaka Prefecture > Osaka (0.04)
  - Middle East > Jordan (0.04)
- Europe
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Middle East > Malta
    - Port Region > Southern Harbour District > Valletta (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.14)
- North America
  - Canada > Quebec
    - Montreal (0.04)
  - Dominican Republic (0.04)
  - United States
    - District of Columbia > Washington (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.04)
    - New Mexico > Santa Fe County
      - Santa Fe (0.04)
    - New York > New York County
      - New York City (0.04)
- South America > Brazil
  - Rio Grande do Sul > Porto Alegre (0.04)

Genre:
- Overview (0.88)
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Commonsense Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found