video stream
AQUILA: A QUIC-Based Link Architecture for Resilient Long-Range UAV Communication
The proliferation of autonomous Unmanned Aerial Vehicles (UAVs) in Beyond Visual Line of Sight (BVLOS) applications is critically dependent on resilient, high-bandwidth, and low-latency communication links. Existing solutions face critical limitations: TCP's head-of-line blocking stalls time-sensitive data, UDP lacks reliability and congestion control, and cellular networks designed for terrestrial users degrade severely for aerial platforms. This paper introduces AQUILA, a cross-layer communication architecture built on QUIC to address these challenges. AQUILA contributes three key innovations: (1) a unified transport layer using QUIC's reliable streams for MAVLink Command and Control (C2) and unreliable datagrams for video, eliminating head-of-line blocking under unified congestion control; (2) a priority scheduling mechanism that structurally ensures C2 latency remains bounded and independent of video traffic intensity; (3) a UAV-adapted congestion control algorithm extending SCReAM with altitude-adaptive delay targeting and telemetry headroom reservation. AQUILA further implements 0-RTT connection resumption to minimize handover blackouts with application-layer replay protection, deployed over an IP-native architecture enabling global operation. Experimental validation demonstrates that AQUILA significantly outperforms TCP- and UDP-based approaches in C2 latency, video quality, and link resilience under realistic conditions, providing a robust foundation for autonomous BVLOS missions.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Jiangxi Province > Nanchang (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Security & Privacy (1.00)
- Telecommunications > Networks (0.69)
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
Zhang, Jiajie, Schwertfeger, Sören, Kleiner, Alexander
W e present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. T o our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Yang, Zhenyu, Zhang, Kairui, Hu, Yuhang, Wang, Bing, Qian, Shengsheng, Wen, Bin, Yang, Fan, Gao, Tingting, Dong, Weiming, Xu, Changsheng
Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar's state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.
- Leisure & Entertainment (0.67)
- Media (0.46)
- Information Technology (0.46)
AVA: Towards Agentic Video Analytics with Vision Language Models
Yan, Yuxuan, Jiang, Shiqi, Cao, Ting, Yang, Yifan, Yang, Qianqian, Shu, Yuanchao, Yang, Yuqing, Qiu, Lili
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
AHA -- Predicting What Matters Next: Online Highlight Detection Without Looking Ahead
Chang, Aiden, De Melo, Celso, Lukin, Stephanie M.
Real-time understanding of continuous video streams is essential for intelligent agents operating in high-stakes environments, including autonomous vehicles, surveillance drones, and disaster response robots. Yet, most existing video understanding and highlight detection methods assume access to the entire video during inference, making them unsuitable for online or streaming scenarios. In particular, current models optimize for offline summarization, failing to support step-by-step reasoning needed for real-time decision-making. We introduce Aha, an autoregressive highlight detection framework that predicts the relevance of each video frame against a task described in natural language. Without accessing future video frames, Aha utilizes a multimodal vision-language model and lightweight, decoupled heads trained on a large, curated dataset of human-centric video labels. To enable scalability, we introduce the Dynamic SinkCache mechanism that achieves constant memory usage across infinite-length streams without degrading performance on standard benchmarks. This encourages the hidden representation to capture high-level task objectives, enabling effective frame-level rankings for informativeness, relevance, and uncertainty with respect to the natural language task. Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks, surpassing even prior offline, full-context approaches and video-language models by +5.9% on TVSum and +8.3% on Mr. Hisum in mAP (mean Average Precision). We explore Aha's potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video. Both experiments demonstrate Aha's potential effectiveness as a real-time reasoning module for downstream planning and long-horizon understanding.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Maryland > Prince George's County > Adelphi (0.04)
- (3 more...)
- Information Technology (0.67)
- Health & Medicine (0.46)
Autonomous UAV Navigation for Search and Rescue Missions Using Computer Vision and Convolutional Neural Networks
Šiktar, Luka, Ćaran, Branimir, Šekoranja, Bojan, Švaco, Marko
In this paper, we present a subsystem, using Unmanned Aerial Vehicles (UAV), for search and rescue missions, focusing on people detection, face recognition and tracking of identified individuals. The proposed solution integrates a UAV with ROS2 framework, that utilizes multiple convolutional neural networks (CNN) for search missions. System identification and PD controller deployment are performed for autonomous UAV navigation. The ROS2 environment utilizes the YOLOv11 and YOLOv11-pose CNNs for tracking purposes, and the dlib library CNN for face recognition. The system detects a specific individual, performs face recognition and starts tracking. If the individual is not yet known, the UAV operator can manually locate the person, save their facial image and immediately initiate the tracking process. The tracking process relies on specific keypoints identified on the human body using the YOLOv11-pose CNN model. These keypoints are used to track a specific individual and maintain a safe distance. To enhance accurate tracking, system identification is performed, based on measurement data from the UAVs IMU. The identified system parameters are used to design PD controllers that utilize YOLOv11-pose to estimate the distance between the UAVs camera and the identified individual. The initial experiments, conducted on 14 known individuals, demonstrated that the proposed subsystem can be successfully used in real time. The next step involves implementing the system on a large experimental UAV for field use and integrating autonomous navigation with GPS-guided control for rescue operations planning.
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- Europe > Croatia > Zagreb County > Zagreb (0.04)
- Asia > China > Chongqing Province > Chongqing (0.04)
- Information Technology (0.50)
- Health & Medicine (0.34)
Infinite Video Understanding
Zhang, Dell, Chen, Xiangyu, Luo, Jixiang, Jia, Mengxi, Sun, Changzhi, Ren, Ruilong, Liu, Jingren, Sun, Hao, Li, Xuelong
The rapid advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have ushered in remarkable progress in video understanding. However, a fundamental challenge persists: effectively processing and comprehending video content that extends beyond minutes or hours. While recent efforts like Video-XL-2 have demonstrated novel architectural solutions for extreme efficiency, and advancements in positional encoding such as HoPE and VideoRoPE++ aim to improve spatio-temporal understanding over extensive contexts, current state-of-the-art models still encounter significant computational and memory constraints when faced with the sheer volume of visual tokens from lengthy sequences. Furthermore, maintaining temporal coherence, tracking complex events, and preserving fine-grained details over extended periods remain formidable hurdles, despite progress in agentic reasoning systems like Deep Video Discovery. This position paper posits that a logical, albeit ambitious, next frontier for multimedia research is Infinite Video Understanding -- the capability for models to continuously process, understand, and reason about video data of arbitrary, potentially never-ending duration. We argue that framing Infinite Video Understanding as a blue-sky research objective provides a vital north star for the multimedia, and the wider AI, research communities, driving innovation in areas such as streaming architectures, persistent memory mechanisms, hierarchical and adaptive representations, event-centric reasoning, and novel evaluation paradigms. Drawing inspiration from recent work on long/ultra-long video understanding and several closely related fields, we outline the core challenges and key research directions towards achieving this transformative capability.
- Asia > China > Shanghai > Shanghai (0.05)
- Asia > China > Tianjin Province > Tianjin (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Vision > Video Understanding (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Towards Perception-based Collision Avoidance for UAVs when Guiding the Visually Impaired
Raj, Suman, Padhi, Swapnil, Bhoot, Ruchi, Modi, Prince, Simmhan, Yogesh
Autonomous navigation by drones using onboard sensors combined with machine learning and computer vision algorithms is impacting a number of domains, including agriculture, logistics, and disaster management. In this paper, we examine the use of drones for assisting visually impaired people (VIPs) in navigating through outdoor urban environments. Specifically, we present a perception-based path planning system for local planning around the neighborhood of the VIP, integrated with a global planner based on GPS and maps for coarse planning. We represent the problem using a geometric formulation and propose a multi DNN based framework for obstacle avoidance of the UAV as well as the VIP. Our evaluations conducted on a drone human system in a university campus environment verifies the feasibility of our algorithms in three scenarios; when the VIP walks on a footpath, near parked vehicles, and in a crowded street.
- Europe > Netherlands > South Holland > Delft (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
- Transportation (1.00)
- Health & Medicine (0.85)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams
Wu, Zike, Yan, Qi, Yi, Xuanyu, Wang, Lele, Liao, Renjie
Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams is crucial for numerous real-world applications. However, existing methods struggle to jointly address three key challenges: 1) processing uncalibrated inputs in real time, 2) accurately modeling dynamic scene evolution, and 3) maintaining long-term stability and computational efficiency. To this end, we introduce StreamSplat, the first fully feed-forward framework that transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner, capable of recovering scene dynamics from temporally local observations. We propose two key technical innovations: a probabilistic sampling mechanism in the static encoder for 3DGS position prediction, and a bidirectional deformation field in the dynamic decoder that enables robust and efficient dynamic modeling. Extensive experiments on static and dynamic benchmarks demonstrate that StreamSplat consistently outperforms prior works in both reconstruction quality and dynamic scene modeling, while uniquely supporting online reconstruction of arbitrarily long video streams. Code and models are available at https://github.com/nickwzk/StreamSplat.
- North America > Canada > British Columbia (0.04)
- North America > Canada > Ontario (0.04)
- Europe > Romania > Black Sea (0.04)
- Asia (0.04)
Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow
Zhou, Juexiao, Han, Zhongyi, Xin, Mankun, He, Xingwei, Wang, Guotao, Song, Jiaoyan, Luo, Gongning, He, Wenjia, Li, Xintong, Chu, Yuetan, Chen, Juanwen, Wang, Bo, Wu, Xia, Duan, Wenwen, Guo, Zhixia, Bai, Liyan, Pan, Yilin, Bi, Xuefei, Liu, Lu, Feng, Long, He, Xiaonan, Gao, Xin
Abstract--Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality . As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered by a fine-tuned facial foundation model. The system is pre-trained on 21 million facial images and subsequently fine-tuned into LiveCAD, a specialized CAD risk assessment model trained on 7,004 facial images from 1,751 subjects across four hospitals in China. DigitalShadow functions passively and contactlessly, extracting facial features from live video streams without requiring active user engagement. Integrated with a personalized database, it generates natural language risk reports and individualized health recommendations. With privacy as a core design principle, DigitalShadow supports local deployment to ensure secure handling of user data. The world's population is rapidly ageing [1], with significant implications for the prevalence of chronic diseases such as Coronary Artery Disease (CAD) [2], affecting not only individuals but also families and societies at large [3]. The number of older people is increasing at an unprecedented rate, projected to grow from approximately 761 million in 2021 to 1.6 billion by 2050, which would represent nearly 16% of the global population, according to the UN's W orld Social Report 2023 [4]. The aging population presents numerous challenges, including increased pressure on healthcare systems, pension schemes, and long-term care facilities, alongside potential economic consequences, which together fuel growing demand for healthcare services [5], [6]. With advancing age, people become more vulnerable to various critical diseases [7], such as CAD [8], stroke [9], cancer [10], and Parkinson's disease (PD) [11], [12], [13], leading to considerable morbidity and mortality [14].
- Asia > China > Beijing > Beijing (0.05)
- Asia > Middle East > Saudi Arabia (0.04)
- Asia > China > Heilongjiang Province > Daqing (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)