A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Open in new window