AITopics | atomic action

Storyboard-guided Alignment for Fine-grained Video Action Recognition

Neural Information Processing SystemsJun-17-2026, 07:19:32 GMT

Fine-grained video action recognition can be formulated as a video-text matching problem. Previous approaches primarily rely on global video semantics to consolidate video embeddings, often leading to misaligned video-text pairs due to inaccurate atomic-level action understanding. This inaccuracy arises due to i) videos with distinct global semantics may share similar atomic actions or visual appearances, and ii) atomic actions can be momentary, gradual, or not directly aligned with overarching video semantics. Inspired by storyboarding, where a script is segmented into individual shots, we propose a multi-granularity framework, SFAR. SFAR generates fine-grained descriptions of common atomic actions for each global semantic using a large language model. Unlike existing works that refine global semantics with auxiliary video frames, SFAR introduces a filtering metric to ensure correspondence between the descriptions and the global semantics, eliminating the need for direct video involvement and thereby enabling more nuanced recognition of subtle actions. By leveraging both global semantics and fine-grained descriptions, our SFAR effectively identifies prominent frames within videos, thereby improving the accuracy of embedding aggregation. Extensive experiments on various video action recognition datasets demonstrate the competitive performance of our SFAR in supervised, few-shot, and zero-shot settings.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment > Sports (0.46)
Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

LabUtopia High Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Neural Information Processing SystemsJun-16-2026, 02:42:23 GMT

Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows. Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physicalchemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence. However, its development has been long hampered by the lack of suitable simulator and benchmarks. In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings.

artificial intelligence, container, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre:

Workflow (1.00)
Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Materials > Chemicals (0.93)
Information Technology (0.68)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Storyboard-guided Alignment for Fine-grained Video Action Recognition

Neural Information Processing SystemsJun-12-2026, 04:50:29 GMT

Fine-grained video action recognition can be formulated as a video-text matching problem. Previous approaches primarily rely on global video semantics to consolidate video embeddings, often leading to misaligned video-text pairs due to inaccurate atomic-level action understanding. This inaccuracy arises due to i) videos with distinct global semantics may share similar atomic actions or visual appearances, and ii) atomic actions can be momentary, gradual, or not directly aligned with overarching video semantics. Inspired by storyboarding, where a script is segmented into individual shots, we propose a multi-granularity framework, SFAR. SFAR generates fine-grained descriptions of common atomic actions for each global semantic using a large language model. Unlike existing works that refine global semantics with auxiliary video frames, SFAR introduces a filtering metric to ensure correspondence between the descriptions and the global semantics, eliminating the need for direct video involvement and thereby enabling more nuanced recognition of subtle actions. By leveraging both global semantics and fine-grained descriptions, our SFAR effectively identifies prominent frames within videos, thereby improving the accuracy of embedding aggregation. Extensive experiments on various video action recognition datasets demonstrate the competitive performance of our SFAR in supervised, few-shot, and zero-shot settings.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

Add feedback

0ff30c4bf31db0119a6219e0d250e037-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 22:25:41 GMT

large language model, machine learning, programming language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Factored Bandits

Julian Zimmert, Yevgeny Seldin

Neural Information Processing SystemsFeb-12-2026, 10:26:08 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, assumption, bandit, (15 more...)

Neural Information Processing Systems

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.72)

Add feedback

95688ba636a4720a85b3634acfec8cdd-Supplemental.pdf

Neural Information Processing SystemsFeb-10-2026, 01:33:14 GMT

annotation, atomic action, video, (15 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Industry:

Consumer Products & Services (0.69)
Leisure & Entertainment > Sports (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.41)

Add feedback

95688ba636a4720a85b3634acfec8cdd-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 01:33:10 GMT

computer vision, hypergraph, proceedings, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Guangxi Province > Nanning (0.04)
North America > Canada > Newfoundland and Labrador > Labrador (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
(3 more...)

Add feedback

SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models

Neural Information Processing SystemsFeb-7-2026, 22:34:06 GMT

Our SheetCopilot correctly completes 44.3% of tasks for

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

MOMA: Multi-Object Multi-Actor Activity Parsing

Neural Information Processing SystemsDec-24-2025, 12:46:17 GMT

Complex activities often involve multiple humans utilizing different objects to complete actions (e.g., in healthcare settings, physicians, nurses, and patients interact with each other and various medical devices). Recognizing activities poses a challenge that requires a detailed understanding of actors' roles, objects' affordances, and their associated relationships. Furthermore, these purposeful activities are composed of multiple achievable steps, including sub-activities and atomic actions, which jointly define a hierarchy of action parts. This paper introduces Activity Parsing as the overarching task of temporal segmentation and classification of activities, sub-activities, atomic actions, along with an instance-level understanding of actors, objects, and their relationships in videos. Involving multiple entities (actors and objects), we argue that traditional pair-wise relationships, often used in scene or action graphs, do not appropriately represent the dynamics between them. Hence, we introduce Action Hypergraph, a spatial-temporal graph containing hyperedges (i.e., edges with higher-order relationships), as a new representation. In addition, we introduce Multi-Object Multi-Actor (MOMA), the first benchmark and dataset dedicated to activity parsing. Lastly, to parse a video, we propose the HyperGraph Activity Parsing (HGAP) network, which outperforms several baselines, including those based on regular graphs and raw video data.

activity parsing, multi-object multi-actor activity parsing, name change, (6 more...)

Neural Information Processing Systems

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Zhou, Yuxiang, Li, Jichang, Zhang, Yanhao, Lu, Haonan, Li, Guanbin

arXiv.org Artificial IntelligenceDec-4-2025

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

artificial intelligence, mobile-agent-rag, opened note, (14 more...)

arXiv.org Artificial Intelligence

2511.12254

Country: North America > United States (0.94)

Genre: Research Report > New Finding (1.00)

Industry: