Goto

Collaborating Authors

 Lin, Guosheng


InsTex: Indoor Scenes Stylized Texture Synthesis

arXiv.org Artificial Intelligence

Generating high-quality textures for 3D scenes is crucial for applications in interior design, gaming, and augmented/virtual reality (AR/VR). Although recent advancements in 3D generative models have enhanced content creation, significant challenges remain in achieving broad generalization and maintaining style consistency across multiple viewpoints. Current methods, such as 2D diffusion models adapted for 3D texturing, suffer from lengthy processing times and visual artifacts, while approaches driven by 3D data often fail to generalize effectively. To overcome these challenges, we introduce InsTex, a two-stage architecture designed to generate high-quality, style-consistent textures for 3D indoor scenes. InsTex utilizes depth-to-image diffusion priors in a coarse-to-fine pipeline, first generating multi-view images with a pre-trained 2D diffusion model and subsequently refining the textures for consistency. Our method supports both textual and visual prompts, achieving state-of-the-art results in visual quality and quantitative metrics, and demonstrates its effectiveness across various 3D texturing applications.


Learning Dynamic Weight Adjustment for Spatial-Temporal Trajectory Planning in Crowd Navigation

arXiv.org Artificial Intelligence

Robot navigation in dense human crowds poses a significant challenge due to the complexity of human behavior in dynamic and obstacle-rich environments. In this work, we propose a dynamic weight adjustment scheme using a neural network to predict the optimal weights of objectives in an optimization-based motion planner. We adopt a spatial-temporal trajectory planner and incorporate diverse objectives to achieve a balance among safety, efficiency, and goal achievement in complex and dynamic environments. We design the network structure, observation encoding, and reward function to effectively train the policy network using reinforcement learning, allowing the robot to adapt its behavior in real time based on environmental and pedestrian information. Simulation results show improved safety compared to the fixed-weight planner and the state-of-the-art learning-based methods, and verify the ability of the learned policy to adaptively adjust the weights based on the observed situations. The approach's feasibility is demonstrated in a navigation task using an autonomous delivery robot across a crowded corridor over a 300 m distance.


Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

arXiv.org Artificial Intelligence

Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data while preserving old knowledge. Current customization diffusion models excel in few-shot tasks but struggle with catastrophic forgetting problems in lifelong generations. In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting. To address these challenges, we first devise a data-free knowledge distillation strategy to tackle relevant concepts forgetting. Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones, without accessing any previous data. Second, we develop an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting. Extensive experiments show that the proposed Lifelong Few-Shot Diffusion (LFS-Diffusion) method can produce high-quality and accurate images while maintaining previously learned knowledge.


MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

arXiv.org Artificial Intelligence

Recently, 3D assets created via reconstruction and generation have matched the quality of manually crafted assets, highlighting their potential for replacement. However, this potential is largely unrealized because these assets always need to be converted to meshes for 3D industry applications, and the meshes produced by current mesh extraction methods are significantly inferior to Artist-Created Meshes (AMs), i.e., meshes created by human artists. Specifically, current mesh extraction methods rely on dense faces and ignore geometric features, leading to inefficiencies, complicated post-processing, and lower representation quality. To address these issues, we introduce MeshAnything, a model that treats mesh extraction as a generation problem, producing AMs aligned with specified shapes. By converting 3D assets in any 3D representation into AMs, MeshAnything can be integrated with various 3D asset production methods, thereby enhancing their application across the 3D industry. The architecture of MeshAnything comprises a VQ-VAE and a shape-conditioned decoder-only transformer. We first learn a mesh vocabulary using the VQ-VAE, then train the shape-conditioned decoder-only transformer on this vocabulary for shape-conditioned autoregressive mesh generation. Our extensive experiments show that our method generates AMs with hundreds of times fewer faces, significantly improving storage, rendering, and simulation efficiencies, while achieving precision comparable to previous methods.


Magic-Boost: Boost 3D Generation with Mutli-View Conditioned Diffusion

arXiv.org Artificial Intelligence

Benefiting from the rapid development of 2D diffusion models, 3D content creation has made significant progress recently. One promising solution involves the fine-tuning of pre-trained 2D diffusion models to harness their capacity for producing multi-view images, which are then lifted into accurate 3D models via methods like fast-NeRFs or large reconstruction models. However, as inconsistency still exists and limited generated resolution, the generation results of such methods still lack intricate textures and complex geometries. To solve this problem, we propose Magic-Boost, a multi-view conditioned diffusion model that significantly refines coarse generative results through a brief period of SDS optimization ($\sim15$min). Compared to the previous text or single image based diffusion models, Magic-Boost exhibits a robust capability to generate images with high consistency from pseudo synthesized multi-view images. It provides precise SDS guidance that well aligns with the identity of the input images, enriching the local detail in both geometry and texture of the initial generative results. Extensive experiments show Magic-Boost greatly enhances the coarse inputs and generates high-quality 3D assets with rich geometric and textural details. (Project Page: https://magic-research.github.io/magic-boost/)


Fine Structure-Aware Sampling: A New Sampling Training Scheme for Pixel-Aligned Implicit Models in Single-View Human Reconstruction

arXiv.org Artificial Intelligence

Pixel-aligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results. Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS.


StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

arXiv.org Artificial Intelligence

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. Additionally, datasets can be arbitrarily scaled. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in more than ten commonly assessed capabilities. Additionally, our model achieves state-of-the-art results across multiple widely recognized multimodal benchmarks.


Learn to Optimize Denoising Scores for 3D Generation: A Unified and Improved Diffusion Prior on NeRF and 3D Gaussian Splatting

arXiv.org Artificial Intelligence

We propose a unified framework aimed at enhancing the diffusion priors for 3D generation tasks. Despite the critical importance of these tasks, existing methodologies often struggle to generate high-caliber results. We begin by examining the inherent limitations in previous diffusion priors. We identify a divergence between the diffusion priors and the training procedures of diffusion models that substantially impairs the quality of 3D generation. To address this issue, we propose a novel, unified framework that iteratively optimizes both the 3D model and the diffusion prior. Leveraging the different learnable parameters of the diffusion prior, our approach offers multiple configurations, affording various trade-offs between performance and implementation complexity. Notably, our experimental results demonstrate that our method markedly surpasses existing techniques, establishing new state-of-the-art in the realm of text-to-3D generation. Furthermore, our approach exhibits impressive performance on both NeRF and the newly introduced 3D Gaussian Splatting backbones. Additionally, our framework yields insightful contributions to the understanding of recent score distillation methods, such as the VSD and DDS loss.


Neural Vector Fields: Generalizing Distance Vector Fields by Codebooks and Zero-Curl Regularization

arXiv.org Artificial Intelligence

Recent neural networks based surface reconstruction can be roughly divided into two categories, one warping templates explicitly and the other representing 3D surfaces implicitly. To enjoy the advantages of both, we propose a novel 3D representation, Neural Vector Fields (NVF), which adopts the explicit learning process to manipulate meshes and implicit unsigned distance function (UDF) representation to break the barriers in resolution and topology. This is achieved by directly predicting the displacements from surface queries and modeling shapes as Vector Fields, rather than relying on network differentiation to obtain direction fields as most existing UDF-based methods do. In this way, our approach is capable of encoding both the distance and the direction fields so that the calculation of direction fields is differentiation-free, circumventing the non-trivial surface extraction step. Furthermore, building upon NVFs, we propose to incorporate two types of shape codebooks, \ie, NVFs (Lite or Ultra), to promote cross-category reconstruction through encoding cross-object priors. Moreover, we propose a new regularization based on analyzing the zero-curl property of NVFs, and implement this through the fully differentiable framework of our NVF (ultra). We evaluate both NVFs on four surface reconstruction scenarios, including watertight vs non-watertight shapes, category-agnostic reconstruction vs category-unseen reconstruction, category-specific, and cross-domain reconstruction.


IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

arXiv.org Artificial Intelligence

Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.