Fukuda, Ken
Comparison of Metadata Representation Models for Knowledge Graph Embeddings
Egami, Shusaku, Matsushita, Kyoumoto, Ugai, Takanori, Fukuda, Ken
Hyper-relational Knowledge Graphs (HRKGs) extend traditional KGs beyond binary relations, enabling the representation of contextual, provenance, and temporal information in domains, such as historical events, sensor data, video content, and narratives. HRKGs can be structured using several Metadata Representation Models (MRMs), including Reification (REF), Singleton Property (SGP), and RDF-star (RDR). However, the effects of different MRMs on KG Embedding (KGE) and Link Prediction (LP) models remain unclear. This study evaluates MRMs in the context of LP tasks, identifies the limitations of existing evaluation frameworks, and introduces a new task that ensures fair comparisons across MRMs. Furthermore, we propose a framework that effectively reflects the knowledge representations of the three MRMs in latent space. Experiments on two types of datasets reveal that REF performs well in simple HRKGs, whereas SGP is less effective. However, in complex HRKGs, the differences among MRMs in the LP tasks are minimal. Our findings contribute to an optimal knowledge representation strategy for HRKGs in LP tasks.
A Video-grounded Dialogue Dataset and Metric for Event-driven Activities
Imrattanatrai, Wiradee, Asada, Masaki, Hasegawa, Kimihiro, Cheng, Zhi-Qi, Fukuda, Ken, Mitamura, Teruko
This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Hasegawa, Kimihiro, Imrattanatrai, Wiradee, Cheng, Zhi-Qi, Asada, Masaki, Holm, Susan, Wang, Yuran, Fukuda, Ken, Mitamura, Teruko
Multimodal systems have great potential to assist humans in procedural activities, where people follow instructions to achieve their goals. Despite diverse application scenarios, systems are typically evaluated on traditional classification tasks, e.g., action recognition or temporal action segmentation. In this paper, we present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios. ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction. For QA annotation, we take a cost-effective human-LLM collaborative approach, where the existing annotation is augmented with LLM-generated QA pairs that are later verified by humans. We then provide the benchmark results to set the baseline performance on ProMQA. Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models. We hope our dataset sheds light on new aspects of models' multimodal understanding capabilities.
Synthetic Multimodal Dataset for Empowering Safety and Well-being in Home Environments
Ugai, Takanori, Egami, Shusaku, Htun, Swe Nwe Nwe, Kozaki, Kouji, Kawamura, Takahiro, Fukuda, Ken
This paper presents a synthetic multimodal dataset of daily activities that fuses video data from a 3D virtual space simulator with knowledge graphs depicting the spatiotemporal context of the activities. The dataset is developed for the Knowledge Graph Reasoning Challenge for Social Issues (KGRC4SI), which focuses on identifying and addressing hazardous situations in the home environment. The dataset is available to the public as a valuable resource for researchers and practitioners developing innovative solutions recognizing human behaviors to enhance safety and well-being in
RDF-star2Vec: RDF-star Graph Embeddings for Data Mining
Egami, Shusaku, Ugai, Takanori, Oota, Masateru, Matsushita, Kyoumoto, Kawamura, Takahiro, Kozaki, Kouji, Fukuda, Ken
Knowledge Graphs (KGs) such as Resource Description Framework (RDF) data represent relationships between various entities through the structure of triples (
Synthesizing Event-centric Knowledge Graphs of Daily Activities Using Virtual Space
Egami, Shusaku, Ugai, Takanori, Oono, Mikiko, Kitamura, Koji, Fukuda, Ken
Artificial intelligence (AI) is expected to be embodied in software agents, robots, and cyber-physical systems that can understand the various contextual information of daily life in the home environment to support human behavior and decision making in various situations. Scene graph and knowledge graph (KG) construction technologies have attracted much attention for knowledge-based embodied question answering meeting this expectation. However, collecting and managing real data on daily activities under various experimental conditions in a physical space are quite costly, and developing AI that understands the intentions and contexts is difficult. In the future, data from both virtual spaces, where conditions can be easily modified, and physical spaces, where conditions are difficult to change, are expected to be combined to analyze daily living activities. However, studies on the KG construction of daily activities using virtual space and their application have yet to progress. The potential and challenges must still be clarified to facilitate AI development for human daily life. Thus, this study proposes the VirtualHome2KG framework to generate synthetic KGs of daily life activities in virtual space. This framework augments both the synthetic video data of daily activities and the contextual semantic data corresponding to the video contents based on the proposed event-centric schema and virtual space simulation results. Therefore, context-aware data can be analyzed, and various applications that have conventionally been difficult to develop due to the insufficient availability of relevant data and semantic information can be developed. We also demonstrate herein the utility and potential of the proposed VirtualHome2KG framework through several use cases, including the analysis of daily activities by querying, embedding, and clustering, and fall risk detection among ...