AITopics | language-refined graph

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing Supplementary Material

Neural Information Processing SystemsApr-25-2026, 02:06:17 GMT

VLMEvaluation To evaluate two VLMs (Frozen in Time [1] and VideoCLIP [13]), we use a hybrid approach that leverages both prototypical networks [11] and the video-language similarity metrics learned by both models. Below, we show an ablation study where we use only the video prototype networks. We show the performance of using only language similarity in the few-shot case to demonstrate the effects of sample removal, and we also show the effects of our hybrid weighting scheme, where we weight the language embeddings five times more than the video embeddings when constructing the hybrid prototype (as opposed to equal weighting during the regular hybrid approach). Although we perform our ablation study with Frozen-in-Time, and use the same weighting scheme and prototype strategy for VideoCLIP as well. For this study, we show activity and sub-activity classification accuracy in the 5-shot case. We visualize whether a given method uses language, video, or both to create its prototype embeddings.

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.91)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.40)

Add feedback

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing

Neural Information Processing SystemsDec-23-2025, 21:53:03 GMT

Video-language models (VLMs), large models pre-trained on numerous but noisy video-text pairs from the internet, have revolutionized activity recognition through their remarkable generalization and open-vocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on high-level video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and fine-grained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, sub-activity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMA-LRG (Multi-Object Multi-Actor Language-Refined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences. Lastly, we present a model-agnostic and lightweight approach to adapting and evaluating VLMs by incorporating structured knowledge from activity graphs into VLMs, addressing the individual limitations of language and graphical models. We demonstrate strong performance on few-shot activity parsing, and our framework is intended to foster future research in the joint modeling of videos, graphs, and language.

human activity, language-refined graph, multi-object multi-actor activity parsing, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing

Neural Information Processing SystemsOct-10-2024, 07:38:51 GMT

Video-language models (VLMs), large models pre-trained on numerous but noisy video-text pairs from the internet, have revolutionized activity recognition through their remarkable generalization and open-vocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on high-level video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and fine-grained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, sub-activity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMA-LRG (Multi-Object Multi-Actor Language-Refined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences.

human activity, language-refined graph, multi-object multi-actor activity parsing, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.91)

Add feedback

Filters

Collaborating Authors

language-refined graph

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing Supplementary Material

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing