Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning