Goto

Collaborating Authors

 Ma, Zhiyuan


Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

arXiv.org Artificial Intelligence

It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.


Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines

arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifiers for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9\% to 9.6\% across all evaluation metrics when compared to strong baselines.


Gumbel Reranking: Differentiable End-to-End Reranker Optimization

arXiv.org Artificial Intelligence

RAG systems rely on rerankers to identify relevant documents. However, fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs. Existing distillation-based approaches suffer from training-inference misalignment and fail to capture interdependencies among candidate documents. To overcome these limitations, we reframe the reranking process as an attention-mask problem and propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap. In our approach, reranker optimization is reformulated as learning a stochastic, document-wise Top-$k$ attention mask using the Gumbel Trick and Relaxed Top-$k$ Sampling. This formulation enables end-to-end optimization by minimizing the overall language loss. Experiments across various settings consistently demonstrate performance gains, including a 10.4\% improvement in recall on HotpotQA for distinguishing indirectly relevant documents.


MVBoost: Boost 3D Reconstruction with Multi-View Refinement

arXiv.org Artificial Intelligence

Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.


DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning

arXiv.org Artificial Intelligence

Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,"neckline" or "girdle") to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.


Mirror-Consistency: Harnessing Inconsistency in Majority Voting

arXiv.org Artificial Intelligence

Self-Consistency, a widely-used decoding strategy, significantly boosts the reasoning capabilities of Large Language Models (LLMs). However, it depends on the plurality voting rule, which focuses on the most frequent answer while overlooking all other minority responses. These inconsistent minority views often illuminate areas of uncertainty within the model's generation process. To address this limitation, we present Mirror-Consistency, an enhancement of the standard Self-Consistency approach. Our method incorporates a 'reflective mirror' into the self-ensemble decoding process and enables LLMs to critically examine inconsistencies among multiple generations. Additionally, just as humans use the mirror to better understand themselves, we propose using Mirror-Consistency to enhance the sample-based confidence calibration methods, which helps to mitigate issues of overconfidence. Our experimental results demonstrate that Mirror-Consistency yields superior performance in both reasoning accuracy and confidence calibration compared to Self-Consistency.


ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

arXiv.org Artificial Intelligence

By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.


Neural Residual Diffusion Models for Deep Scalable Vision Generation

arXiv.org Artificial Intelligence

The most advanced diffusion models have recently adopted increasingly deep stacked networks (e.g., U-Net or Transformer) to promote the generative emergence capabilities of vision generation models similar to large language models (LLMs). However, progressively deeper stacked networks will intuitively cause numerical propagation errors and reduce noisy prediction capabilities on generative data, which hinders massively deep scalable training of vision generation models. In this paper, we first uncover the nature that neural networks being able to effectively perform generative denoising lies in the fact that the intrinsic residual unit has consistent dynamic property with the input signal's reverse diffusion process, thus supporting excellent generative abilities. Afterwards, we stand on the shoulders of two common types of deep stacked networks to propose a unified and massively scalable Neural Residual Diffusion Models framework (Neural-RDM for short), which is a simple yet meaningful change to the common architecture of deep generative networks by introducing a series of learnable gated residual parameters that conform to the generative dynamics. Experimental results on various generative tasks show that the proposed neural residual models obtain state-of-the-art scores on image's and video's generative benchmarks. Rigorous theoretical proofs and extensive experiments also demonstrate the advantages of this simple gated residual mechanism consistent with dynamic modeling in improving the fidelity and consistency of generated content and supporting large-scale scalable training.


Exploring Adversarial Robustness of Deep State Space Models

arXiv.org Artificial Intelligence

Deep State Space Models (SSMs) have proven effective in numerous task scenarios but face significant security challenges due to Adversarial Perturbations (APs) in real-world deployments. Adversarial Training (AT) is a mainstream approach to enhancing Adversarial Robustness (AR) and has been validated on various traditional DNN architectures. However, its effectiveness in improving the AR of SSMs remains unclear. While many enhancements in SSM components, such as integrating Attention mechanisms and expanding to data-dependent SSM parameterizations, have brought significant gains in Standard Training (ST) settings, their potential benefits in AT remain unexplored. To investigate this, we evaluate existing structural variants of SSMs with AT to assess their AR performance. We observe that pure SSM structures struggle to benefit from AT, whereas incorporating Attention yields a markedly better trade-off between robustness and generalization for SSMs in AT compared to other components. Nonetheless, the integration of Attention also leads to Robust Overfitting (RO) issues. To understand these phenomena, we empirically and theoretically analyze the output error of SSMs under AP. We find that fixed-parameterized SSMs have output error bounds strictly related to their parameters, limiting their AT benefits, while input-dependent SSMs may face the problem of error explosion. Furthermore, we show that the Attention component effectively scales the output error of SSMs during training, enabling them to benefit more from AT, but at the cost of introducing RO due to its high model complexity. Inspired by this, we propose a simple and effective Adaptive Scaling (AdS) mechanism that brings AT performance close to Attention-integrated SSMs without introducing the issue of RO.


UltraMedical: Building Specialized Generalists in Biomedicine

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community.