Goto

Collaborating Authors

 Healy, Graham


The CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding

arXiv.org Artificial Intelligence

Multi-perspective datasets that combine firstperson and third-person views are rare and typically include only a Egocentric video has seen increased interest in recent years, as limited number of activities and do not last long enough to capture it is used in a range of areas. However, most existing datasets the full range of interactions and social dynamics characteristic of are limited to a single perspective. In this paper, we present the everyday life. CASTLE 2024 dataset, a multimodal collection containing ego-and In this paper, we introduce the CASTLE 2024 dataset, a multimodal exo-centric (i.e., first-and third-person perspective) video and audio multi-perspective collection of ego-centric (first-person) from 15 time-aligned sources, as well as other sensor streams and and exo-centric (third-person) high-resolution video recordings, auxiliary data. The dataset was recorded by volunteer participants augmented with additional sensor streams, designed to capture the over four days in a fixed location and includes the point of view complexity of daily human experiences. The dataset captures the of 10 participants, with an additional 5 fixed cameras providing an experience and daily interaction of ten volunteer participants over exocentric perspective. The entire dataset contains over 600 hours the course of four days. It shows a broad range of domestic and of UHD video recorded at 50 frames per second. In contrast to other social activities, including cooking, eating, cleaning, meeting and datasets, CASTLE 2024 does not contain any partial censoring, such leisure activities, capturing authentic interactions among participants.


Diffusing Surrogate Dreams of Video Scenes to Predict Video Memorability

arXiv.org Artificial Intelligence

As part of the MediaEval 2022 Predicting Video Memorability task we explore the relationship between visual memorability, the visual representation that characterises it, and the underlying concept portrayed by that visual representation. We achieve state-of-the-art memorability prediction performance with a model trained and tested exclusively on surrogate dream images, elevating concepts to the status of a cornerstone memorability feature, and finding strong evidence to suggest that the intrinsic memorability of visual content can be distilled to its underlying concept or meaning irrespective of its specific visual representational.


Overview of The MediaEval 2022 Predicting Video Memorability Task

arXiv.org Artificial Intelligence

This paper describes the 5th edition of the Predicting Video Memorability Task as part of MediaEval2022. This year we have reorganised and simplified the task in order to lubricate a greater depth of inquiry. Similar to last year, two datasets are provided in order to facilitate generalisation, however, this year we have replaced the TRECVid2019 Video-to-Text dataset with the VideoMem dataset in order to remedy underlying data quality issues, and to prioritise short-term memorability prediction by elevating the Memento10k dataset as the primary dataset. Additionally, a fully fledged electroencephalography (EEG)-based prediction sub-task is introduced. In this paper, we outline the core facets of the task and its constituent sub-tasks; describing the datasets, evaluation metrics, and requirements for participant submissions.


Experiences from the MediaEval Predicting Media Memorability Task

arXiv.org Artificial Intelligence

The Predicting Media Memorability task in the MediaEval evaluation campaign has been running annually since 2018 and several different tasks and data sets have been used in this time. This has allowed us to compare the performance of many memorability prediction techniques on the same data and in a reproducible way and to refine and improve on those techniques. The resources created to compute media memorability are now being used by researchers well beyond the actual evaluation campaign. In this paper we present a summary of the task, including the collective lessons we have learned for the research community.


Predicting Media Memorability: Comparing Visual, Textual and Auditory Features

arXiv.org Artificial Intelligence

This paper describes our approach to the Predicting Media Memorability task in MediaEval 2021, which aims to address the question of media memorability by setting the task of automatically predicting video memorability. This year we tackle the task from a comparative standpoint, looking to gain deeper insights into each of three explored modalities, and using our results from last year's submission (2020) as a point of reference. Our best performing short-term memorability model (0.132) tested on the TRECVid2019 dataset -- just like last year -- was a frame based CNN that was not trained on any TRECVid data, and our best short-term memorability model (0.524) tested on the Memento10k dataset, was a Bayesian Ride Regressor fit with DenseNet121 visual features.


Overview of The MediaEval 2021 Predicting Media Memorability Task

arXiv.org Artificial Intelligence

This paper describes the MediaEval 2021 Predicting Media Memorability}task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset generalisation. In addition, an Electroencephalography (EEG)-based prediction pilot subtask is introduced. In this paper, we outline the main aspects of the task and describe the datasets, evaluation metrics, and requirements for participants' submissions.


An Annotated Video Dataset for Computing Video Memorability

arXiv.org Artificial Intelligence

Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both long-term and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant's ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.


Investigating Memorability of Dynamic Media

arXiv.org Artificial Intelligence

The Predicting Media Memorability task in MediaEval'20 has some challenging aspects compared to previous years. In this paper we identify the high-dynamic content in videos and dataset of limited size as the core challenges for the task, we propose directions to overcome some of these challenges and we present our initial result in these directions.


Leveraging Audio Gestalt to Predict Media Memorability

arXiv.org Artificial Intelligence

Memorability determines what evanesces into emptiness, and what worms its way into the deepest furrows of our minds. It is the key to curating more meaningful media content as we wade through daily digital torrents. The Predicting Media Memorability task in MediaEval 2020 aims to address the question of media memorability by setting the task of automatically predicting video memorability. Our approach is a multimodal deep learning-based late fusion that combines visual, semantic, and auditory features. We used audio gestalt to estimate the influence of the audio modality on overall video memorability, and accordingly inform which combination of features would best predict a given video's memorability scores.


Contrastive Representation Learning: A Framework and Review

arXiv.org Machine Learning

Contrastive Learning has recently received interest due to its success in self-supervised representation learning in the computer vision domain. However, the origins of Contrastive Learning date as far back as the 1990s and its development has spanned across many fields and domains including Metric Learning and natural language processing. In this paper we provide a comprehensive literature review and we propose a general Contrastive Representation Learning framework that simplifies and unifies many different contrastive learning methods. We also provide a taxonomy for each of the components of contrastive learning in order to summarise it and distinguish it from other forms of machine learning. We then discuss the inductive biases which are present in any contrastive learning system and we analyse our framework under different views from various sub-fields of Machine Learning. Examples of how contrastive learning has been applied in computer vision, natural language processing, audio processing, and others, as well as in Reinforcement Learning are also presented. Finally, we discuss the challenges and some of the most promising future research directions ahead.