Qiu, Di
CHOSEN: Contrastive Hypothesis Selection for Multi-View Depth Refinement
Qiu, Di, Zhang, Yinda, Beeler, Thabo, Tankovich, Vladimir, Häne, Christian, Fanello, Sean, Rhemann, Christoph, Escolano, Sergio Orts
We propose CHOSEN, a simple yet flexible, robust and effective multi-view depth refinement framework. It can be employed in any existing multi-view stereo pipeline, with straightforward generalization capability for different multi-view capture systems such as camera relative positioning and lenses. Given an initial depth estimation, CHOSEN iteratively re-samples and selects the best hypotheses, and automatically adapts to different metric or intrinsic scales determined by the capture system. The key to our approach is the application of contrastive learning in an appropriate solution space and a carefully designed hypothesis feature, based on which positive and negative hypotheses can be effectively distinguished. Integrated in a simple baseline multi-view stereo pipeline, CHOSEN delivers impressive quality in terms of depth and normal accuracy compared to many current deep learning based multi-view stereo pipelines.
SWBT: Similarity Weighted Behavior Transformer with the Imperfect Demonstration for Robotic Manipulation
Wu, Kun, Liu, Ning, Zhao, Zhen, Qiu, Di, Li, Jinming, Che, Zhengping, Xu, Zhiyuan, Qiu, Qinru, Tang, Jian
Imitation learning (IL), aiming to learn optimal control policies from expert demonstrations, has been an effective method for robot manipulation tasks. However, previous IL methods either only use expensive expert demonstrations and omit imperfect demonstrations or rely on interacting with the environment and learning from online experiences. In the context of robotic manipulation, we aim to conquer the above two challenges and propose a novel framework named Similarity Weighted Behavior Transformer (SWBT). SWBT effectively learn from both expert and imperfect demonstrations without interaction with environments. We reveal that the easy-to-get imperfect demonstrations, such as forward and inverse dynamics, significantly enhance the network by learning fruitful information. To the best of our knowledge, we are the first to attempt to integrate imperfect demonstrations into the offline imitation learning setting for robot manipulation tasks. Extensive experiments on the ManiSkill2 benchmark built on the high-fidelity Sapien simulator and real-world robotic manipulation tasks demonstrated that the proposed method can extract better features and improve the success rates for all tasks. Our code will be released upon acceptance of the paper.
Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing
Lan, Yushi, Tan, Feitong, Qiu, Di, Xu, Qiangeng, Genova, Kyle, Huang, Zeng, Fanello, Sean, Pandey, Rohit, Funkhouser, Thomas, Loy, Chen Change, Zhang, Yinda
We present a novel framework for generating photorealistic Editing capabilities for 3D-aware GANs have also been 3D human head and subsequently manipulating achieved through latent space auto-decoding, altering a 2D and reposing them with remarkable flexibility. The proposed semantic segmentation [62, 63], or modifying the underlying approach leverages an implicit function representation geometry scaffold [64]. However, generation and editing of 3D human heads, employing 3D Gaussians anchored quality tends to be unstable and less diversified due to on a parametric face model. To enhance representational the inherent limitation of GANs, and detailed-level editing capabilities and encode spatial information, we is not well supported due to feature entanglement in the embed a lightweight tri-plane payload within each Gaussian compact latent space or tri-plane representations.
Modal Uncertainty Estimation via Discrete Latent Representation
Qiu, Di, Lui, Lok Ming
Many important problems in the real world don't have unique solutions. It is thus important for machine learning models to be capable of proposing different plausible solutions with meaningful probability measures. In this work we introduce such a deep learning framework that learns the one-to-many mappings between the inputs and outputs, together with faithful uncertainty measures. We call our framework modal uncertainty estimation since we model the one-to-many mappings to be generated through a set of discrete latent variables, each representing a latent mode hypothesis that explains the corresponding type of input-output relationship. The discrete nature of the latent representations thus allows us to estimate for any input the conditional probability distribution of the outputs very effectively. Both the discrete latent space and its uncertainty estimation are jointly learned during training. We motivate our use of discrete latent space through the multi-modal posterior collapse problem in current conditional generative models, then develop the theoretical background, and extensively validate our method on both synthetic and realistic tasks. Our framework demonstrates significantly more accurate uncertainty estimation than the current state-of-the-art methods, and is informative and convenient for practical use. Making predictions in the real world has to face with various uncertainties.