AITopics

Technology: Information Technology > Artificial Intelligence > Cognitive Science (0.32)

Neural Information Processing SystemsMar-21-2026, 21:51:59 GMT

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos. Despite apparent relevance and potential complementarity, these two problems have been investigated as separate and distinct tasks. In this work, we tackle these two problems, action segmentation, and action anticipation, jointly using a unified diffusion model dubbed ActFusion. The key idea to unification is to train the model to effectively handle both visible and invisible parts of the sequence in an integrated manner;the visible part is for temporal segmentation, and the invisible part is for future anticipation. To this end, we introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future.Experimental results demonstrate the bi-directional benefits between action segmentation and anticipation.ActFusion achieves the state-of-the-art performance across the standard benchmarks of 50 Salads, Breakfast, and GTEA, outperforming task-specific models in both of the two tasks with a single unified model through joint learning.

artificial intelligence, machine learning, proceedings, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.60)
Information Technology > Artificial Intelligence > Machine Learning (0.43)

Neural Information Processing SystemsFeb-17-2026, 03:58:12 GMT

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos.

artificial intelligence, machine learning, natural language, (17 more...)

Country:

Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceNov-18-2025

Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Chu, Qiaohui, Zhang, Haoyu, Liu, Meng, Feng, Yisen, Shi, Haoxiang, Nie, Liqiang

Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) un-derutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) intention inference (reason) action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability. Introduction In real-world applications such as human-computer interaction (Azam and Desai 2024; Plizzari et al. 2024), augmented reality (Abreu et al. 2024; Xu et al. 2024), and assistive systems for visually impaired individuals (Lee et al. 2024; Xiao et al. 2025), AI agents must accurately interpret user intent and demonstrate effective long-term planning capabilities within egocentric vision scenarios.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2508.01742

Country: Asia > China (0.46)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Kim, Seulgi, Kokilepersaud, Kiran, Prabhushankar, Mohit, AlRegib, Ghassan

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

arXiv.org Artificial IntelligenceNov-11-2025

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.

artificial intelligence, machine learning, modality, (15 more...)

2511.0645

Country: North America > United States (1.00)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)

Neural Information Processing SystemsOct-10-2025, 12:01:22 GMT

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

Temporal action segmentation and long-term action anticipation are two popular vision tasks for the temporal analysis of actions in videos.

action segmentation, anticipation, computer vision, (13 more...)

Country:

Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceJun-12-2025

Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

Chu, Qiaohui, Zhang, Haoyu, Feng, Yisen, Liu, Meng, Guan, Weili, Wang, Yaowei, Nie, Liqiang

In this report, we present a novel three-stage framework developed for the Ego4D Long-T erm Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder . The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction.

arxiv preprint arxiv, large language model, natural language, (13 more...)

2506.0255

Country: Asia > China (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Kim, Seulgi, Kaviani, Ghazal, Prabhushankar, Mohit, AlRegib, Ghassan

Multi-level and Multi-modal Action Anticipation

arXiv.org Artificial IntelligenceJun-4-2025

Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce \textit{Multi-level and Multi-modal Action Anticipation (m\&m-Ant)}, a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08\% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: https://github.com/olivesgatech/mM-ant.

artificial intelligence, machine learning, natural language, (17 more...)

2506.02382

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsMay-27-2025, 11:18:48 GMT

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

actfusion, action segmentation and anticipation, unified diffusion model, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.79)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.63)

arXiv.org Artificial IntelligenceMay-14-2025

Hierarchical and Multimodal Data for Daily Activity Understanding

Kaviani, Ghazal, Yarici, Yavuz, Kim, Seulgi, Prabhushankar, Mohit, AlRegib, Ghassan, Solh, Mashhour, Patil, Ameya

Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai's multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: https://alregib.ece.gatech.edu/software-and-datasets/darai-daily-activity-recordings-for-artificial-intelligence-and-machine-learning/

artificial intelligence, deep learning, machine learning, (20 more...)

2504.17696

Country: North America > United States (0.45)

Genre:

Research Report > New Finding (0.92)
Workflow (0.87)

Industry:

Information Technology (1.00)
Education (0.93)
Health & Medicine > Consumer Health (0.92)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)