Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Park, Brendan, Janecek, Madeline, Ezzati-Jivan, Naser, Li, Yifeng, Emami, Ali

Jun-3-2024–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.

computational linguistic, proceedings, pronoun, (12 more...)

arXiv.org Artificial Intelligence

Jun-3-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - New York > New York County
      - New York City (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > Ontario
    - Toronto (0.04)
    - Niagara Region > St. Catharines (0.04)
- Europe
  - Monaco (0.04)
  - Ireland (0.04)
  - Italy
    - Tuscany > Florence (0.04)
    - Calabria > Catanzaro Province
      - Catanzaro (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning > Commonsense Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found