motion module
Review for NeurIPS paper: Multi-agent Trajectory Prediction with Fuzzy Query Attention
Weaknesses: The experiments have been extensive, however I have following three crucial questions to better understand the performance boost arising from the overall architecture: 1. Improvement arising from interaction module or motion module? Taking Social LSTM [1] to be an interaction-based baseline, the proposed architecture has two different components: the interaction and motion modules. Is the boost coming from the interaction module which is FQA in comparison to Social Pooling [1]? Or is it the new motion module? An ablation study showing the performance while keeping the motion module the same as the baseline will help answer this question. The authors use the term Fuzzy to describe continuous-valued decisions over their discrete-valued boolean counterparts.
3D Multi-Object Tracking with Semi-Supervised GRU-Kalman Filter
Wang, Xiaoxiang, Liu, Jiaxin, Feng, Miaojie, Zhang, Zhaoxing, Yang, Xin
3D Multi-Object Tracking (MOT), a fundamental component of environmental perception, is essential for intelligent systems like autonomous driving and robotic sensing. Although Tracking-by-Detection frameworks have demonstrated excellent performance in recent years, their application in real-world scenarios faces significant challenges. Object movement in complex environments is often highly nonlinear, while existing methods typically rely on linear approximations of motion. Furthermore, system noise is frequently modeled as a Gaussian distribution, which fails to capture the true complexity of the noise dynamics. These oversimplified modeling assumptions can lead to significant reductions in tracking precision. To address this, we propose a GRU-based MOT method, which introduces a learnable Kalman filter into the motion module. This approach is able to learn object motion characteristics through data-driven learning, thereby avoiding the need for manual model design and model error. At the same time, to avoid abnormal supervision caused by the wrong association between annotations and trajectories, we design a semi-supervised learning strategy to accelerate the convergence speed and improve the robustness of the model. Evaluation experiment on the nuScenes and Argoverse2 datasets demonstrates that our system exhibits superior performance and significant potential compared to traditional TBD methods.
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
Xu, Jiaqi, Zou, Xinyi, Huang, Kunzhe, Chen, Yunkuo, Liu, Bo, Cheng, MengLi, Shi, Xing, Huang, Jun
This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.
AnimateDiff-Lightning: Cross-Model Diffusion Distillation
Video generative models are gaining great attention We present AnimateDiff-Lightning for lightning-fast lately. Text-to-video models [2-4, 6, 8, 30, 36, 44] allow the video generation. Our model uses progressive adversarial creation of videos straight from ideation; image-to-video diffusion distillation to achieve new state-of-the-art in models [2, 4, 6, 36] enable more fine-grained control over few-step video generation. We discuss our modifications to content and composition; video-to-video models [4, 6] can adapt it for the video modality. Furthermore, we propose to convert existing videos to different styles, such as anime or simultaneously distill the probability flow of multiple base cartoon. The advancement in video generation has enabled diffusion models, resulting in a single distilled motion module brand-new creative possibilities.
Magic-Me: Identity-Specific Video Customized Diffusion
Ma, Ze, Zhou, Daquan, Yeh, Chun-Hsiao, Wang, Xue-She, Li, Xiuyu, Yang, Huanrui, Dong, Zhen, Keutzer, Kurt, Feng, Jiashi
Creating content for a specific identity (ID) has shown significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven content generation has achieved great progress with the ID in the images controllable. However, extending it to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified subject ID defined by a few images, VCD reinforces the identity information extraction and injects frame-wise correlation at the initialization stage for stable video outputs with identity preserved to a large extent. To achieve this, we propose three novel components that are essential for high-quality ID preservation: 1) an ID module trained with the cropped identity by prompt-to-segmentation to disentangle the ID information and the background noise for more accurate ID token learning; 2) a text-to-video (T2V) VCD module with 3D Gaussian Noise Prior for better inter-frame consistency and 3) video-to-video (V2V) Face VCD and Tiled VCD modules to deblur the face and upscale the video for higher resolution. Despite its simplicity, we conducted extensive experiments to verify that VCD is able to generate stable and high-quality videos with better ID over the selected strong baselines. Besides, due to the transferability of the ID module, VCD is also working well with finetuned text-to-image models available publically, further improving its usability. The codes are available at https://github.com/Zhen-Dong/Magic-Me.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (2 more...)
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Yuwei, Yang, Ceyuan, Rao, Anyi, Wang, Yaohui, Qiao, Yu, Lin, Dahua, Dai, Bo
With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at https://animatediff.github.io/ .
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Seo, Ahjeong, Kang, Gi-Cheon, Park, Joonhan, Zhang, Byoung-Tak
Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ahjeongseo/MASN-pytorch.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)