ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities