Scale Multi Modal for Human Activity Understanding Grounded in Motion Captured Labels
–Neural Information Processing Systems
We introduce OctoNet, a large-scale, multi-modal, multi-view human activity dataset designed to advance human activity understanding and multi-modal learning. OctoNet comprises 12 heterogeneous modalities (including RGB, depth, thermal cameras, infrared arrays, audio, millimeter-wave radar, Wi-Fi, IMU, and more) recorded from 41 participants under multi-view sensor setups, yielding over 67.72M synchronized frames. The data encompass 62 daily activities spanning structured routines, freestyle behaviors, human-environment interaction, healthcare tasks, etc. All modalities are annotated by high-fidelity 3D pose labels captured via a professional motion-capture system, allowing precise alignment and rich supervision across sensors and views. OctoNet is one of the most comprehensive datasets of its kind, enabling a wide range of learning tasks such as human activity recognition, 3D pose estimation, multi-modal fusion, cross-modal supervision, and sensor foundation models. Extensive experiments have been conducted to demonstrate the sensing capacity using various baselines. OctoNet offers a unique and unified testbed for developing and benchmarking generalizable, robust models for human-centric sensing AI.
Neural Information Processing Systems
Jun-15-2026, 00:51:14 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.46)
- Technology:
- Information Technology
- Communications > Networks (1.00)
- Artificial Intelligence
- Natural Language (0.93)
- Vision > Video Understanding (0.55)
- Machine Learning > Neural Networks (0.46)
- Information Technology