Goto

Collaborating Authors

 Xu, Dan


Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective

arXiv.org Artificial Intelligence

Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.


I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

arXiv.org Artificial Intelligence

This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.


Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

arXiv.org Artificial Intelligence

In this survey, we present community appeals for a unified framework [Ci et al., 2023; a comprehensive overview of HcFMs by proposing Wang et al., 2023; Chen et al., 2024; Huang et al., 2024a] to a taxonomy that categorizes current approaches unlock systematic understanding and a wide range of humancentric into four groups: (1) Human-centric Perception applications for everybody. Foundation Models that capture fine-grained features Inspired by rapid advancements of general foundation models, for multi-modal 2D and 3D understanding; (2) e.g., large language models (LLMs), large vision models Human-centric AIGC Foundation Models that generate (LVMs) and text-to-image generative models, and their high-fidelity, diverse human-related content; presents of a paradigm shift from end-to-end learning of (3) Unified Perception and Generation Models that task-specific models to generalist models, a recent trend is integrate these capabilities to enhance both human to develop Human-centric Foundation Models (HcFM) that understanding and synthesis; and (4) Human-centric satisfy three criteria, namely generalization, broad applicability, Agentic Foundation Models that extend beyond perception and high fidelity. Generalization ensures robustness and generation to learn human-like intelligence to unseen conditions, enabling the model to perform consistently and interactive behaviors for humanoid embodied across varied environments.


MM-Ego: Towards Building Egocentric Multimodal LLMs

arXiv.org Artificial Intelligence

This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.


Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

arXiv.org Artificial Intelligence

This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework designed to perform free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. In the context of dynamic clothed humans, complex cloth dynamics generate non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for the rendering quality. The conventional approach models non-rigid motions as spatial (3D) deviations in addition to skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity without a direct supervision. To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions. Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available at the project page: https://stevejaehyeok.github.io/publications/moco-nerf.


Learning Online Scale Transformation for Talking Head Video Generation

arXiv.org Artificial Intelligence

One-shot talking head video generation uses a source image and driving video to create a synthetic video where the source person's facial movements imitate those of the driving video. However, differences in scale between the source and driving images remain a challenge for face reenactment. Existing methods attempt to locate a frame in the driving video that aligns best with the source image, but imprecise alignment can result in suboptimal outcomes. To this end, we introduce a scale transformation module that can automatically adjust the scale of the driving image to fit that of the source image, by using the information of scale difference maintained in the detected keypoints of the source image and the driving frame. Furthermore, to keep perceiving the scale information of faces during the generation process, we incorporate the scale information learned from the scale transformation module into each layer of the generation process to produce a final result with an accurate scale. Our method can perform accurate motion transfer between the two images without any anchor frame, achieved through the contributions of the proposed online scale transformation facial reenactment network. Extensive experiments have demonstrated that our proposed method adjusts the scale of the driving face automatically according to the source face, and generates high-quality faces with an accurate scale in the cross-identity facial reenactment.


Sample-efficient Imitative Multi-token Decision Transformer for Generalizable Real World Driving

arXiv.org Artificial Intelligence

The realm of autonomous driving research has witnessed remarkable progress, with simulation technologies [1][2][3][4] reaching unprecedented levels of realism and the burgeoning availability of real-world driving datasets [5][6][7][8]. Despite these advancements, data-driven planning continues to confront a formidable obstacle: the infinite state space and extensive data distribution characteristic of real-world driving. Imitation learning approaches encounter hurdles [9][10] when presented with scenarios that deviate from the training distribution, exemplified by rare events like emergency braking for unforeseen obstacles. Similarly, these methods grapple with long-tail distribution phenomena, such as navigating through unexpected weather conditions or handling the erratic movements of a jaywalking pedestrian. On the other hand, reinforcement learning (RL) strategies aim to cultivate policies through reward-based learning. RL has difficulty bridging the sim-real gap and sampling efficiency [11].


X-VILA: Cross-Modality Alignment for Large Language Model

arXiv.org Artificial Intelligence

We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.


Enhance Planning with Physics-informed Safety Controller for End-to-end Autonomous Driving

arXiv.org Artificial Intelligence

Recent years have seen a growing research interest in applications of Deep Neural Networks (DNN) on autonomous vehicle technology. The trend started with perception and prediction a few years ago and it is gradually being applied to motion planning tasks. Despite the performance of networks improve over time, DNN planners inherit the natural drawbacks of Deep Learning. Learning-based planners have limitations in achieving perfect accuracy on the training dataset and network performance can be affected by out-of-distribution problem. In this paper, we propose FusionAssurance, a novel trajectory-based end-to-end driving fusion framework which combines physics-informed control for safety assurance. By incorporating Potential Field into Model Predictive Control, FusionAssurance is capable of navigating through scenarios that are not included in the training dataset and scenarios where neural network fail to generalize. The effectiveness of the approach is demonstrated by extensive experiments under various scenarios on the CARLA benchmark.


Personalized LoRA for Human-Centered Text Understanding

arXiv.org Artificial Intelligence

Effectively and efficiently adapting a pre-trained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA.