Goto

Collaborating Authors

 Wang, Angtian


CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

arXiv.org Artificial Intelligence

Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.


iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning

arXiv.org Artificial Intelligence

Different from human nature, it is still common practice today for vision tasks to train deep learning models only initially and on fixed datasets. A variety of approaches have recently addressed handling continual data streams. However, extending these methods to manage out-of-distribution (OOD) scenarios has not effectively been investigated. On the other hand, it has recently been shown that non-continual neural mesh models exhibit strong performance in generalizing to such OOD scenarios. To leverage this decisive property in a continual learning setting, we propose incremental neural mesh models that can be extended with new meshes over time. In addition, we present a latent space initialization strategy that enables us to allocate feature space for future unseen classes in advance and a positional regularization term that forces the features of the different classes to consistently stay in respective latent space regions. We demonstrate the effectiveness of our method through extensive experiments on the Pascal3D and ObjectNet3D datasets and show that our approach outperforms the baselines for classification by 2 6% in the in-domain and by 6 50% in the OOD setting. Our work also presents the first incremental learning approach for pose estimation. Our code and model can be found at github.com/Fischer-Tom/iNeMo.


Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

arXiv.org Artificial Intelligence

For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions within 3D scenes from video is crucial for effective reasoning. In this work, we introduce a video question answering dataset SuperCLEVR-Physics that focuses on the dynamics properties of objects. We concentrate on physical concepts -- velocity, acceleration, and collisions within 4D scenes, where the model needs to fully understand these dynamics properties and answer the questions built on top of them. From the evaluation of a variety of current VLMs, we find that these models struggle with understanding these dynamic properties due to the lack of explicit knowledge about the spatial structure in 3D and world dynamics in time variants. To demonstrate the importance of an explicit 4D dynamics representation of the scenes in understanding world dynamics, we further propose NS-4Dynamics, a Neural-Symbolic model for reasoning on 4D Dynamics properties under explicit scene representation from videos. Using scene rendering likelihood combining physical prior distribution, the 4D scene parser can estimate the dynamics properties of objects over time to and interpret the observation into 4D scene representation as world states. By further incorporating neural-symbolic reasoning, our approach enables advanced applications in future prediction, factual reasoning, and counterfactual reasoning. Our experiments show that our NS-4Dynamics suppresses previous VLMs in understanding the dynamics properties and answering questions about factual queries, future prediction, and counterfactual reasoning. Moreover, based on the explicit 4D scene representation, our model is effective in reconstructing the 4D scenes and re-simulate the future or counterfactual events.


3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation

arXiv.org Artificial Intelligence

Regression-based methods for 3D human pose estimation directly predict the 3D pose parameters from a 2D image using deep networks. While achieving state-of-the-art performance on standard benchmarks, their performance degrades under occlusion. In contrast, optimization-based methods fit a parametric body model to 2D features in an iterative manner. The localized reconstruction loss can potentially make them robust to occlusion, but they suffer from the 2D-3D ambiguity. Motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness. In particular, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity. Experiments show that 3DNBF outperforms other approaches on both occluded and standard benchmarks. Code is available at https://github.com/edz-o/3DNBF


ROBIN : A Benchmark for Robustness to Individual Nuisances in Real-World Out-of-Distribution Shifts

arXiv.org Artificial Intelligence

Enhancing the robustness in real-world scenarios has been proven very challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or they simply measure robustness as generalization between datasets and hence ignore the effects of individual nuisance factors. In this work, we introduce ROBIN, a benchmark dataset for diagnosing the robustness of vision algorithms to individual nuisances in real-world images. ROBIN builds on 10 rigid categories from the PASCAL VOC 2012 and ImageNet datasets and includes out-of-distribution examples of the objects 3D pose, shape, texture, context and weather conditions. ROBIN is richly annotated to enable benchmark models for image classification, object detection, and 3D pose estimation. We provide results for a number of popular baselines and make several interesting observations: 1. Some nuisance factors have a much stronger negative effect on the performance compared to others. Moreover, the negative effect of an OODnuisance depends on the downstream vision task. 2. Current approaches to enhance OOD robustness using strong data augmentation have only marginal effects in real-world OOD scenarios, and sometimes even reduce the OOD performance. 3. We do not observe any significant differences between convolutional and transformer architectures in terms of OOD robustness. We believe our dataset provides a rich testbed to study the OOD robustness of vision algorithms and will help to significantly push forward research in this area.