Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images