Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos