TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Open in new window