video virtual try-on
SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models
Nguyen, Hung, Nguyen, Quang Qui-Vinh, Nguyen, Khoi, Nguyen, Rang
Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. The project page is available at https://swift-try.github.io/.
PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm
Chang, Tianyu, Wei, Xiaohao Chen. Zhichao, Zhang, Xuanpu, Chen, Qing-Guo, Luo, Weihua, Yang, Xun
Video Virtual Try-on aims to fluently transfer the garment image to a semantically aligned try-on area in the source person video. Previous methods leveraged the inpainting mask to remove the original garment in the source video, thus achieving accurate garment transfer on simple model videos. However, when these methods are applied to realistic video data with more complex scene changes and posture movements, the overly large and incoherent agnostic masks will destroy the essential spatial-temporal information of the original video, thereby inhibiting the fidelity and coherence of the try-on video. To alleviate this problem, we propose a novel point-enhanced mask-free video virtual try-on framework (PEMF-VVTO). Specifically, we first leverage the pre-trained mask-based try-on model to construct large-scale paired training data (pseudo-person samples). Training on these mask-free data enables our model to perceive the original spatial-temporal information while realizing accurate garment transfer. Then, based on the pre-acquired sparse frame-cloth and frame-frame point alignments, we design the point-enhanced spatial attention (PSA) and point-enhanced temporal attention (PTA) to further improve the try-on accuracy and video coherence of the mask-free model. Concretely, PSA explicitly guides the garment transfer to desirable locations through the sparse semantic alignments of video frames and cloth. PTA exploits the temporal attention on sparse point correspondences to enhance the smoothness of generated videos. Extensive qualitative and quantitative experiments clearly illustrate that our PEMF-VVTO can generate more natural and coherent try-on videos than existing state-of-the-art methods.
BIGO and iQIYI's ClothFormer: Realistic Video Virtual Try-on Come True
Total global retail e-commerce sales have more than tripled over the last six years and are projected to top US$7 trillion by 2025. With fashion claiming an increasing share of this market, suppliers are increasingly deploying AI-powered virtual try-on systems. Such systems are not only changing buyers' shopping habits and boosting the e-commerce industry, they also have applications in short video and other popular domains. While the quality of image-based virtual try-on methods has dramatically improved, video-based virtual try-on remains relatively underdeveloped, as it is difficult and computationally costly to generate visually pleasing and temporally coherent video results. In the new paper ClothFormer: Taming Video Virtual Try-on in All Module, a research team from BIGO Technology and iQIYI Inc. presents ClothFormer, a novel video virtual try-on framework that preserves clothes' and humans' features and details to generate realistic and temporally smooth try-on videos that surpass the outputs of current state-of-the-art virtual try-on systems by a large margin.