Goto

Collaborating Authors

 video description



Blink budget security cameras will support AI-powered video descriptions

Engadget

The feature mirrors a similar option on Ring cameras and doorbells. Amazon's budget Blink smart home brand is adding AI-generated video descriptions as a new benefit for subscribers. Blink Video Descriptions are text descriptions of the motion doorbells and cameras capture, and they'll be available in beta starting today, November 17. Not unlike Ring Video Descriptions, a feature offered on Amazon's other smart home brand, Blink's AI-generated descriptions are supposed to be a concise way to check out what's happening in and around your home. Any kind of motion can produce a video clip and a notification in the Blink app, but video descriptions should help weed out which ones are worth watching and worrying about.


Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Baghel, Shruti Singh, Rathore, Yash Pratap Singh, Jena, Sushovan, Pradhan, Anurag, Shukla, Amit, Bhavsar, Arnav, Goyal, Pawan

arXiv.org Artificial Intelligence

Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.


Finally! Ring cams will stop bombarding you with AI alerts

PCWorld

When you purchase through links in our articles, we may earn a small commission. Ring cams will stop bombarding you with AI alerts With its latest feature, Ring aims to combine multiple AI-powered event summaries into a single notification. Ring's AI event notifications are handy when it comes to getting text descriptions of what's happening around your abode, but too many of the AI-generated pop-ups can get annoying fast. To cut down on the chatter, Ring is debuting a new feature: AI Single Event Alert, which takes multiple AI notifications from related motion events captured by your Ring cameras and combines them into--you guessed it--a single alert. The feature, which is slated to begin rolling out today for subscribers to Ring's priciest subscription plan, joins a couple of other Ring AI tools that were first introduced last fall: Video Descriptions, which employ AI to write brief summaries of video events, and Smart Video Search, which allows you to comb through your saved videos using natural-language queries.



VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Waheed, Abdul, Wu, Zhen, Alharthi, Dareen, Kim, Seungone, Raj, Bhiksha

arXiv.org Artificial Intelligence

Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.


Content and Engagement Trends in COVID-19 YouTube Videos: Evidence from the Late Pandemic

Thakur, Nirmalya, Hartel, Madeline D, Boden, Lane Michael, Enriquez, Dallas, Ricks, Boston Joyner

arXiv.org Artificial Intelligence

This work investigated about 10,000 COVID-19-related YouTube videos published between January 2023 and October 2024 to evaluate how temporal, lexical, linguistic, and structural factors influenced engagement during the late pandemic period. Publishing activity showed consistent weekday effects: in the first window, average views peaked on Mondays at 92,658; in the second, on Wednesdays at 115,479; and in the third, on Fridays at 84,874, reflecting a shift in audience attention toward mid- and late week. Lexical analysis of video titles revealed recurring high-frequency keywords related to COVID-19 and YouTube features, including COVID, coronavirus, shorts, and live. Frequency analysis revealed sharp spikes, with COVID appearing in 799 video titles in August 2024, while engagement analysis showed that videos titled with shorts attracted very high views, peaking at 2.16 million average views per video in June 2023. Analysis of sentiment of video descriptions in English showed weak correlation with views in the raw data (Pearson r = 0.0154, p = 0.2987), but stronger correlations emerged once outliers were addressed, with Spearman r = 0.110 (p < 0.001) and Pearson r = 0.0925 (p < 0.001). Category-level analysis of video durations revealed contrasting outcomes: long videos focusing on people and blogs averaged 209,114 views, short entertainment videos averaged 288,675 views, and medium-to-long news and politics videos averaged 51,309 and 59,226 views, respectively. These results demonstrate that engagement patterns of COVID-19-related videos on YouTube during the late pandemic followed distinct characteristics driven by publishing schedules, title vocabulary, topics, and genre-specific duration effects.


From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Masala, Mihai, Leordeanu, Marius

arXiv.org Artificial Intelligence

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.


Ring harnesses generative AI to power Ring Video Descriptions

PCWorld

Ring is bringing generative AI to its family of home security cameras and video doorbells with a new feature called Video Descriptions. Once this feature is enabled, the motion alerts triggered by Ring cameras will be accompanied by an AI-generated analysis of the motion that triggered the camera to record. In a blog post earlier today, Ring founder Jamie Siminoff described how the push notifications Ring users receive on their smartphones when motion is detected will be enhanced with text descriptions of what that motion was. "This new generative AI feature," Siminoff said, "helps you quickly distinguish between urgent and everyday activity with a quick glance at your phone." Ring will use genereative AI to deliver descriptions of the events its security cameras and video doorbells capture on video.


Proactive Assistant Dialogue Generation from Streaming Egocentric Videos

Zhang, Yichi, Dong, Xin Luna, Lin, Zhaojiang, Madotto, Andrea, Kumar, Anuj, Damavandi, Babak, Chai, Joyce, Moon, Seungwhan

arXiv.org Artificial Intelligence

Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in \dataset, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks. Project page: https://pro-assist.github.io/