OMCAT: Omni Context Aware Transformer
Goel, Arushi, Sapra, Karan, Le, Matthieu, Valle, Rafael, Tao, Andrew, Catanzaro, Bryan
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is https://om-cat.github.io. Okay, so the sound of children playing There is a sound of children playing from There are two sounds of children from from 6 to 7 seconds. After this sound, 16 to 17 seconds and from 17 to 25 playing, one from 6 to 7 seconds and from 7 to 16 seconds, a man is talking while seconds, the man is holding a shovel the other one from 16 to 17 seconds. Which one are you referring to? Figure 1: Illustration of a video sequence from our proposed OCTAV dataset. Large language models (LLMs) (Achiam et al., 2023; Touvron et al., 2023) have achieved remarkable breakthroughs in both text generation and comprehension (McKeown, 1992; Achiam et al., 2023) tasks. Since then, significant progress has been made to extend LLMs to multimodal LLMs (Cheng et al., 2024; Li et al., 2023b; Maaz et al., 2023; Li et al., 2024), which integrate visual and audio inputs with textual instructions to provide understanding in multimodal contexts (Yang et al., 2022b; Chen et al., 2023a;b). In this paper, we address these limitations by proposing a new dataset OCTAV and a model called OMCAT. The Omni Context and Temporal Audio Video dataset, OCTAV, consists of question-answer pairs for a video. The Omni Context Aware Transformer, OMCAT, addresses the limitations of existing models (Maaz et al., 2023; Tang et al., 2024; Su et al., 2023; Cheng et al., 2024) through a unified audio and visual language model by effectively incorporating time representations to ground the modalities temporally. However, these models still face challenges in handling fine-grained, cross-modal temporal understanding when both audio and video are provided.
arXiv.org Artificial Intelligence
Oct-15-2024