AITopics

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Neural Information Processing SystemsFeb-18-2026, 04:02:11 GMT

Auslan-Daily: Australian Sign Language Translation for Daily Communication and News

Considering different geographic regions generally have their own native sign languages, it is valuable to establish corresponding SL T datasets to support related communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale dataset for SL T.

artificial intelligence, machine learning, natural language, (15 more...)

Country:

Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(22 more...)

Industry:

Education > Curriculum > Subject-Specific Education (0.96)
Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsFeb-17-2026, 16:53:36 GMT

e6b2b48b5ed90d07c305932729927781-Paper-Conference.pdf

benchmark, large language model, machine learning, (21 more...)

Country: Asia > China (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Vision (0.69)
(3 more...)

Neural Information Processing SystemsFeb-10-2026, 21:33:05 GMT

A Benchmark for Long-context Interleaved Video-Language Understanding Haoning Wu Dongxu Li Bei Chen Junnan Li Rhymes AI

Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context .

large language model, machine learning, natural language, (20 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Media > Film (0.93)
Leisure & Entertainment (0.93)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

arXiv.org Artificial IntelligenceDec-10-2025

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Jiang, Zifan, Jang, Youngjoon, Momeni, Liliane, Varol, Gül, Ebling, Sarah, Zisserman, Andrew

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

artificial intelligence, machine learning, natural language, (17 more...)

2512.08094

Country: Europe (0.46)

Genre: Research Report (0.64)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

arXiv.org Artificial IntelligenceNov-24-2025

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

Hori, Chiori, Masuyama, Yoshiki, Jain, Siddarth, Corcodel, Radu, Jha, Devesh, Romeres, Diego, Roux, Jonathan Le

Abstract--Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former . Experiments with the Y ouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

large language model, machine learning, natural language, (17 more...)

2511.17335

Genre: Research Report (1.00)

Industry: Education > Educational Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Kundu, Arpita, Chakraborty, Joyita, Desarkar, Anindita, Sen, Aritra, Patil, Srushti Anil, Raman, Vishwanathan

V-SAT: Video Subtitle Annotation Tool

arXiv.org Artificial IntelligenceOct-29-2025

The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.

artificial intelligence, large language model, natural language, (16 more...)

2510.2418

Country: Asia > India (0.17)

Genre: Research Report (0.50)

Industry:

Media (0.49)
Leisure & Entertainment (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

arXiv.org Artificial IntelligenceOct-28-2025

LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction

Pramov, Aleksandar

This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at https://github.com/dsgt-arc/mediaeval-2025-memorability

large language model, machine learning, natural language, (15 more...)

2510.22829

Country:

North America > United States (0.15)
Europe > Switzerland (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Zhu, Zeyu, Lin, Kevin Qinghong, Shou, Mike Zheng

Paper2Video: Automatic Video Generation from Scientific Papers

arXiv.org Artificial IntelligenceOct-10-2025

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

large language model, machine learning, natural language, (18 more...)