Multi-modal Situated Reasoning in 3D Scenes