AITopics | instructional video

Collaborating Authors

instructional video

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsFeb-17-2026, 08:35:27 GMT

Instructional "how-to" videos online allow users to master new skills and everyday DIY tasks, from

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Israel (0.04)

Genre:

Research Report > Promising Solution (0.68)
Instructional Material > Course Syllabus & Notes (0.43)

Industry:

Education > Educational Technology > Audio & Video (0.53)
Education > Educational Technology > Media (0.43)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

804e757b7d7043c26701c3a313032101-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-16-2026, 04:05:30 GMT

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.61)

Industry:

Education (0.69)
Information Technology > Software (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

acaa23f71f963e96c8847585e71352d6-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 19:30:53 GMT

computer vision, dataset, noun, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsDec-26-2025, 21:30:28 GMT

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

keystep recognition, name change, video-mined task graph, (2 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.32)

Industry:

Education > Educational Technology > Media (0.68)
Education > Educational Technology > Audio & Video (0.68)

Technology: Information Technology > Artificial Intelligence (0.79)

Add feedback

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Neural Information Processing SystemsDec-24-2025, 08:21:46 GMT

We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter-and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

instructional video, name change, self-supervised spatial grounding, (7 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.30)

Industry:

Education > Educational Technology > Media (0.66)
Education > Educational Technology > Audio & Video (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Kim, Joochan, Jung, Minjoon, Zhang, Byoung-Tak

arXiv.org Artificial IntelligenceDec-8-2025

Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos. Due to the dominant action pair'Take-Background', the model fails to predict the action'Open.' Action recognition in instructional videos has witnessed remarkable progress, primarily driven by models that excel in curated benchmark datasets (Farha & Gall, 2019; Ishikawa et al., 2021; Li et al., 2020; Yi et al., 2021).

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.0658

Country: Asia > South Korea > Seoul > Seoul (0.05)

Genre:

Research Report (0.84)
Instructional Material > Course Syllabus & Notes (0.66)

Industry: Education > Educational Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

Add feedback

DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Pham, Eddison, Priyadarshini, Prisha, Maliackel, Adrian, Bandi, Kanishk, Meo, Cristian, Zhu, Kevin

arXiv.org Artificial IntelligenceDec-2-2025

Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video's educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset's scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.23907

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.66)

Industry: Education > Educational Technology > Audio & Video (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval

Wang, Yu, Tan, Tianhao, Wang, Yifei

arXiv.org Artificial IntelligenceOct-13-2025

Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node em-beddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.09553

Country: