A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning