Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

May-27-2025, 21:55:33 GMT–Neural Information Processing Systems

Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn.

multimodal language model, reasoning, sketchpad, (7 more...)

Neural Information Processing Systems

May-27-2025, 21:55:33 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.80)
  - Vision (0.59)