Koo, Jaywon
Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities
Ayyubi, Hammad A., Thomas, Christopher, Chum, Lovish, Lokesh, Rahul, Chen, Long, Niu, Yulei, Lin, Xudong, Feng, Xuande, Koo, Jaywon, Ray, Sounak, Chang, Shih-Fu
Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research.
Multi-Modality Multi-Loss Fusion Network
Wu, Zehui, Gong, Ziwei, Koo, Jaywon, Hirschberg, Julia
We compare different methods for extracting The multimodal affective computing field has seen audio features as well as different fusion significant advances in feature extraction and multimodal network methods to combine audio and text signals fusion methodologies in recent years. By to identify the best-performing procedures. We find combining audio, text and visual signals, these that the addition of audio signals consistently improves models offer a more comprehensive, nuanced understanding performance and also that our transformer of human emotions. However, there fusion network further enhances results for most are still limitations: hand-crafted feature extraction metrics and achieves state-of-the-art results across algorithms often lack flexibility and generalization all datasets, indicating its efficacy in enhancing across diverse tasks. To overcome these limitations, cross-modality modeling and its potential for multimodal recent studies have proposed fully end-to-end models emotion detection. From multi-loss training, that optimize both feature extraction and learning we also observe that 1) using distinct labels for processes jointly (Dai et al., 2021). Our work each modality in multi-loss training significantly extracts feature representations from pre-trained benefits the models' performance, and 2) training models for different modalities and combines them on multimodal features improves not only the overall in an end-to-end manner, which provides a comprehensive model performance but also the model's accuracy and adaptable solution for multimodal on the single-modality subnet.