Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition