SPOT! Revisiting Video-Language Models for Event Understanding