TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models