The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding

Rossetto, Luca, Bailer, Werner, Dang-Nguyen, Duc-Tien, Healy, Graham, Jónsson, Björn Þór, Kongmeesub, Onanong, Le, Hoang-Bao, Rudinac, Stevan, Schöffmann, Klaus, Spiess, Florian, Tran, Allie, Tran, Minh-Triet, Tran, Quang-Linh, Gurrin, Cathal

arXiv.org Artificial Intelligence 

Multi-perspective datasets that combine firstperson and third-person views are rare and typically include only a Egocentric video has seen increased interest in recent years, as limited number of activities and do not last long enough to capture it is used in a range of areas. However, most existing datasets the full range of interactions and social dynamics characteristic of are limited to a single perspective. In this paper, we present the everyday life. CASTLE 2024 dataset, a multimodal collection containing ego-and In this paper, we introduce the CASTLE 2024 dataset, a multimodal exo-centric (i.e., first-and third-person perspective) video and audio multi-perspective collection of ego-centric (first-person) from 15 time-aligned sources, as well as other sensor streams and and exo-centric (third-person) high-resolution video recordings, auxiliary data. The dataset was recorded by volunteer participants augmented with additional sensor streams, designed to capture the over four days in a fixed location and includes the point of view complexity of daily human experiences. The dataset captures the of 10 participants, with an additional 5 fixed cameras providing an experience and daily interaction of ten volunteer participants over exocentric perspective. The entire dataset contains over 600 hours the course of four days. It shows a broad range of domestic and of UHD video recorded at 50 frames per second. In contrast to other social activities, including cooking, eating, cleaning, meeting and datasets, CASTLE 2024 does not contain any partial censoring, such leisure activities, capturing authentic interactions among participants.