Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
–Neural Information Processing Systems
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn.
Neural Information Processing Systems
May-27-2025, 21:55:33 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (0.80)
- Vision (0.59)
- Information Technology > Artificial Intelligence