Multimodal Tabular Reasoning with Privileged Structured Information
–Neural Information Processing Systems
Tabular reasoning requires complex, multi-step information extraction and logical inference, such as aggregation, comparison, or calculation over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured text tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning directly from table images. Our core strategy is to leverage privileged structured information---specifically, the ground-truth structured table data available during training but inaccessible at test time---to enhance multimodal large language models (MLLMs). The key challenges lie in: accurately aligning visual representations with the structured information, particularly mapping the visual evidence to logical steps; and effectively transferring the reasoning skills learned during training to the MLLM for visual inference. To address these, we introduce {\sc Turbo} (TabUlar Reasoning with Bridged infOrmation), a new framework for multimodal tabular reasoning using privileged information.
Neural Information Processing Systems
Jun-13-2026, 11:07:07 GMT
- Technology: