Goto

Collaborating Authors

 Pan, Yulin


ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

arXiv.org Artificial Intelligence

Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: https://ali-vilab.github.io/ace-page/.


A generalized likelihood-weighted optimal sampling algorithm for rare-event probability quantification

arXiv.org Artificial Intelligence

In this work, we introduce a new acquisition function for sequential sampling to efficiently quantify rare-event statistics of an input-to-response (ItR) system with given input probability and expensive function evaluations. Our acquisition is a generalization of the likelihood-weighted (LW) acquisition that was initially designed for the same purpose and then extended to many other applications. The improvement in our acquisition comes from the generalized form with two additional parameters, by varying which one can target and address two weaknesses of the original LW acquisition: (1) that the input space associated with rare-event responses is not sufficiently stressed in sampling; (2) that the surrogate model (generated from samples) may have significant deviation from the true ItR function, especially for cases with complex ItR function and limited number of samples. In addition, we develop a critical procedure in Monte-Carlo discrete optimization of the acquisition function, which achieves orders of magnitude acceleration compared to existing approaches for such type of problems. The superior performance of our new acquisition to the original LW acquisition is demonstrated in a number of test cases, including some cases that were designed to show the effectiveness of the original LW acquisition. We finally apply our method to an engineering example to quantify the rare-event roll-motion statistics of a ship in a random sea.


An adaptive multi-fidelity sampling framework for safety analysis of connected and automated vehicles

arXiv.org Artificial Intelligence

Testing and evaluation are expensive but critical steps in the development of connected and automated vehicles (CAVs). In this paper, we develop an adaptive sampling framework to efficiently evaluate the accident rate of CAVs, particularly for scenario-based tests where the probability distribution of input parameters is known from the Naturalistic Driving Data. Our framework relies on a surrogate model to approximate the CAV performance and a novel acquisition function to maximize the benefit (information to accident rate) of the next sample formulated through an information-theoretic consideration. In addition to the standard application with only a single high-fidelity model of CAV performance, we also extend our approach to the bi-fidelity context where an additional low-fidelity model can be used at a lower computational cost to approximate the CAV performance. Accordingly, for the second case, our approach is formulated such that it allows the choice of the next sample in terms of both fidelity level (i.e., which model to use) and sampling location to maximize the benefit per cost. Our framework is tested in a widely-considered two-dimensional cut-in problem for CAVs, where Intelligent Driving Model (IDM) with different time resolutions are used to construct the high and low-fidelity models. We show that our single-fidelity method outperforms the existing approach for the same problem, and the bi-fidelity method can further save half of the computational cost to reach a similar accuracy in estimating the accident rate.


VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

arXiv.org Artificial Intelligence

Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.