AITopics | video description

Collaborating Authors

video description

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Blink budget security cameras will support AI-powered video descriptions

EngadgetDec-8-2025, 14:30:00 GMT

The feature mirrors a similar option on Ring cameras and doorbells. Amazon's budget Blink smart home brand is adding AI-generated video descriptions as a new benefit for subscribers. Blink Video Descriptions are text descriptions of the motion doorbells and cameras capture, and they'll be available in beta starting today, November 17. Not unlike Ring Video Descriptions, a feature offered on Amazon's other smart home brand, Blink's AI-generated descriptions are supposed to be a concise way to check out what's happening in and around your home. Any kind of motion can produce a video clip and a notification in the Blink app, but video descriptions should help weed out which ones are worth watching and worrying about.

artificial intelligence, internet of things, video description, (8 more...)

Engadget

Country: North America > United States > Illinois (0.06)

Industry:

Information Technology > Smart Houses & Appliances (0.87)
Information Technology > Security & Privacy (0.55)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Internet of Things (0.87)
Information Technology > Communications > Mobile (0.40)

Add feedback

Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Baghel, Shruti Singh, Rathore, Yash Pratap Singh, Jena, Sushovan, Pradhan, Anurag, Shukla, Amit, Bhavsar, Arnav, Goyal, Pawan

arXiv.org Artificial IntelligenceNov-14-2025

Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

large language model, natural language, video description, (14 more...)

arXiv.org Artificial Intelligence

2511.10615

Genre: Research Report (1.00)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Finally! Ring cams will stop bombarding you with AI alerts

PCWorldNov-6-2025, 15:00:00 GMT

When you purchase through links in our articles, we may earn a small commission. Ring cams will stop bombarding you with AI alerts With its latest feature, Ring aims to combine multiple AI-powered event summaries into a single notification. Ring's AI event notifications are handy when it comes to getting text descriptions of what's happening around your abode, but too many of the AI-generated pop-ups can get annoying fast. To cut down on the chatter, Ring is debuting a new feature: AI Single Event Alert, which takes multiple AI notifications from related motion events captured by your Ring cameras and combines them into--you guessed it--a single alert. The feature, which is slated to begin rolling out today for subscribers to Ring's priciest subscription plan, joins a couple of other Ring AI tools that were first introduced last fall: Video Descriptions, which employ AI to write brief summaries of video events, and Smart Video Search, which allows you to comb through your saved videos using natural-language queries.

digital magazine, gaming laptop mobile monitor pc, security software storage streaming wi-fi, (9 more...)

PCWorld

Country: North America > United States > California (0.05)

Industry:

Information Technology > Smart Houses & Appliances (0.50)
Information Technology > Security & Privacy (0.37)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Waheed, Abdul, Wu, Zhen, Alharthi, Dareen, Kim, Seungone, Raj, Bhiksha

arXiv.org Artificial IntelligenceSep-29-2025

Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.21451

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Consumer Health (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.64)

Add feedback

Content and Engagement Trends in COVID-19 YouTube Videos: Evidence from the Late Pandemic

Thakur, Nirmalya, Hartel, Madeline D, Boden, Lane Michael, Enriquez, Dallas, Ricks, Boston Joyner

arXiv.org Artificial IntelligenceSep-3-2025

This work investigated about 10,000 COVID-19-related YouTube videos published between January 2023 and October 2024 to evaluate how temporal, lexical, linguistic, and structural factors influenced engagement during the late pandemic period. Publishing activity showed consistent weekday effects: in the first window, average views peaked on Mondays at 92,658; in the second, on Wednesdays at 115,479; and in the third, on Fridays at 84,874, reflecting a shift in audience attention toward mid- and late week. Lexical analysis of video titles revealed recurring high-frequency keywords related to COVID-19 and YouTube features, including COVID, coronavirus, shorts, and live. Frequency analysis revealed sharp spikes, with COVID appearing in 799 video titles in August 2024, while engagement analysis showed that videos titled with shorts attracted very high views, peaking at 2.16 million average views per video in June 2023. Analysis of sentiment of video descriptions in English showed weak correlation with views in the raw data (Pearson r = 0.0154, p = 0.2987), but stronger correlations emerged once outliers were addressed, with Spearman r = 0.110 (p < 0.001) and Pearson r = 0.0925 (p < 0.001). Category-level analysis of video durations revealed contrasting outcomes: long videos focusing on people and blogs averaged 209,114 views, short entertainment videos averaged 288,675 views, and medium-to-long news and politics videos averaged 51,309 and 59,226 views, respectively. These results demonstrate that engagement patterns of COVID-19-related videos on YouTube during the late pandemic followed distinct characteristics driven by publishing schedules, title vocabulary, topics, and genre-specific duration effects.

artificial intelligence, natural language, text processing, (19 more...)

arXiv.org Artificial Intelligence

2509.01954

Country:

Asia (1.00)
North America > United States > South Dakota (0.15)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Masala, Mihai, Leordeanu, Marius

arXiv.org Artificial IntelligenceJul-8-2025

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2507.04815

Country:

Europe (0.93)
North America > United States (0.92)

Genre: Research Report > Promising Solution (0.48)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Ring harnesses generative AI to power Ring Video Descriptions

PCWorldJun-25-2025, 13:00:00 GMT

Ring is bringing generative AI to its family of home security cameras and video doorbells with a new feature called Video Descriptions. Once this feature is enabled, the motion alerts triggered by Ring cameras will be accompanied by an AI-generated analysis of the motion that triggered the camera to record. In a blog post earlier today, Ring founder Jamie Siminoff described how the push notifications Ring users receive on their smartphones when motion is detected will be enhanced with text descriptions of what that motion was. "This new generative AI feature," Siminoff said, "helps you quickly distinguish between urgent and everyday activity with a quick glance at your phone." Ring will use genereative AI to deliver descriptions of the events its security cameras and video doorbells capture on video.

machine learning, natural language, video description, (12 more...)

PCWorld

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.87)

Add feedback

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

Zhang, Yichi, Dong, Xin Luna, Lin, Zhaojiang, Madotto, Andrea, Kumar, Anuj, Damavandi, Babak, Chai, Joyce, Moon, Seungwhan

arXiv.org Artificial IntelligenceJun-9-2025

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.05904

Country: North America (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Filters

Collaborating Authors

video description

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

efe36e55d80a94d1726f660b8d237a0f-Paper-Conference.pdf

efe36e55d80a94d1726f660b8d237a0f-Paper-Conference.pdf

Blink budget security cameras will support AI-powered video descriptions

Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Finally! Ring cams will stop bombarding you with AI alerts

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Content and Engagement Trends in COVID-19 YouTube Videos: Evidence from the Late Pandemic

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Ring harnesses generative AI to power Ring Video Descriptions

Proactive Assistant Dialogue Generation from Streaming Egocentric Videos