MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification