Goto

Collaborating Authors

 video quality



GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

Song, Zhiye, Dai, Steve, Keller, Ben, Khailany, Brucek

arXiv.org Artificial Intelligence

Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).


Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

Wu, Jiangkai, Ren, Zhiyuan, Liu, Liming, Zhang, Xinggong

arXiv.org Artificial Intelligence

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we call for AI-oriented RTC research, exploring the network requirement shift from "humans watching video" to "AI understanding video". We begin by recognizing the main differences between AI Video Chat and traditional RTC. Then, through prototype measurements, we identify that ultra-low bitrate is a key factor for low latency. To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat. DeViBench is open-sourced at: https://github.com/pku-netvideo/DeViBench.


Human-in-the-Loop Bandwidth Estimation for Quality of Experience Optimization in Real-Time Video Communication

Khairy, Sami, Mittag, Gabriel, Gopal, Vishak, Cutler, Ross

arXiv.org Artificial Intelligence

The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly $1$M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by $11.41\%$ compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.



DiTraj: training-free trajectory control for video diffusion transformer

Lei, Cheng, Zhang, Jiayu, Ma, Yue, Wang, Xinyu, Chen, Long, Tang, Liang, Yan, Yiqiang, Su, Fei, Zhao, Zhicheng

arXiv.org Artificial Intelligence

Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object's trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens' position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.


Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

Adnan, Muhammad, Kurella, Nithesh, Arunkumar, Akhil, Nair, Prashant J.

arXiv.org Artificial Intelligence

Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to \latencyimprv end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \href{https://github.com/STAR-Laboratory/foresight}{https://github.com/STAR-Laboratory/foresight}.


ASL360: AI-Enabled Adaptive Streaming of Layered 360° Video over UAV-assisted Wireless Networks

Mohammadhosseini, Alireza, Chakareski, Jacob, Mastronarde, Nicholas

arXiv.org Artificial Intelligence

We propose ASL360, an adaptive deep reinforcement learning-based scheduler for on-demand 360° video streaming to mobile VR users in next generation wireless networks. We aim to maximize the overall Quality of Experience (QoE) of the users served over a UAV-assisted 5G wireless network. Our system model comprises a macro base station (MBS) and a UAV-mounted base station which both deploy mm-Wave transmission to the users. The 360° video is encoded into dependent layers and segmented tiles, allowing a user to schedule downloads of each layer's segments. Furthermore, each user utilizes multiple buffers to store the corresponding video layer's segments. We model the scheduling decision as a Constrained Markov Decision Process (CMDP), where the agent selects Base or Enhancement layers to maximize the QoE and use a policy gradient-based method (PPO) to find the optimal policy. Additionally, we implement a dynamic adjustment mechanism for cost components, allowing the system to adaptively balance and prioritize the video quality, buffer occupancy, and quality change based on real-time network and streaming session conditions. We demonstrate that ASL360 significantly improves the QoE, achieving approximately 2 dB higher average video quality, 80% lower average rebuffering time, and 57% lower video quality variation, relative to competitive baseline methods. Our results show the effectiveness of our layered and adaptive approach in enhancing the QoE in immersive videostreaming applications, particularly in dynamic and challenging network environments.


O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

Chen, Yuqing, Wang, Junjie, Liu, Lin, Chu, Ruihang, Zhang, Xiaopeng, Tian, Qi, Yang, Yujiu

arXiv.org Artificial Intelligence

Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a "copy-form" preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. https://cyqii.github.io/O-DisCo-Edit.github.io/


Google's AI video creator gets major upgrade. How to use it.

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. With every passing month, AI-generated content gets harder to distinguish from material made by human beings. Google's latest video maker is a case in point: The newly launched Veo 3 model is a step up in terms of realism, while also adding audio for the first time, so synced dialog, natural sound, and other audio effects can be added in. Google promises the new Veo 3 model has a better understanding of real world physics, and is smarter at turning your text prompts into video clips. Those clips are capped at eight seconds for now, and at a resolution of 720p--presumably because of the high computing (and environmental) demands of generating these videos.