Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language