Goto

Collaborating Authors

 cross-view consistency



Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

Recent voxel-based 3D object detectors for autonomous vehicles learn point cloud representations either from bird eye view (BEV) or range view (RV, a.k.a. the perspective view). However, each view has its own strengths and weaknesses. In this paper, we present a novel framework to unify and leverage the benefits from both BEV and RV. The widely-used cuboid-shaped voxels in Cartesian coordinate system only benefit learning BEV feature map. Therefore, to enable learning both BEV and RV feature maps, we introduce Hybrid-Cylindrical-Spherical voxelization. Our findings show that simply adding detection on another view as auxiliary supervision will lead to poor performance. We proposed a pair of cross-view transformers to transform the feature maps into the other view and introduce cross-view consistency loss on them. Comprehensive experiments on the challenging NuScenes Dataset validate the effectiveness of our proposed method by virtue of joint optimization and complementary information on both views. Remarkably, our approach achieved mAP of 55.8%, outperforming all published approaches by at least 3% in overall performance and up to 16.5% in safety-crucial categories like cyclist.


Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency

Neural Information Processing Systems

This paper proposes a human parsing based texture transfer model via cross-view consistency learning to generate the texture of 3D human body from a single image. We use the semantic parsing of human body as input for providing both the shape and pose information to reduce the appearance variation of human image and preserve the spatial distribution of semantic parts. Meanwhile, in order to improve the prediction for textures of invisible parts, we explicitly enforce the consistency across different views of the same subject by exchanging the textures predicted by two views to render images during training. The perception loss and total variation regularization are optimized to maximize the similarity between rendered and input images, which does not necessitate extra 3D texture supervision. Experimental results on pedestrian images and fashion photos demonstrate that our method can produce higher quality textures with convincing details than other texture generation methods.


Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency

Neural Information Processing Systems

Firstly, obtaining ground-truth 3D textures is time-consuming and labor-intensive. Secondly, textures of invisible human body parts are difficult to predict due to only one image available and lack of information from other views at inference.


Geometry-aware 4D Video Generation for Robot Manipulation

Liu, Zeyi, Li, Shuang, Cousineau, Eric, Feng, Siyuan, Burchfiel, Benjamin, Song, Shuran

arXiv.org Artificial Intelligence

Understanding and predicting the dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training. This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints based solely on the given RGB-D observations, without requiring camera poses as inputs. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, supporting robust robot manipulation and generalization to novel camera viewpoints.


Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Zhang, Ziyi, Shen, Li, Ye, Deheng, Luo, Yong, Zhao, Huangxuan, Zhang, Lefei

arXiv.org Artificial Intelligence

Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.


MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Wang, Yuhan, Hong, Fangzhou, Yang, Shuai, Jiang, Liming, Wu, Wayne, Loy, Chen Change

arXiv.org Artificial Intelligence

Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.


3D Gaussian Inpainting with Depth-Guided Cross-View Consistency

Huang, Sheng-Yu, Chou, Zi-Ting, Wang, Yu-Chiang Frank

arXiv.org Artificial Intelligence

When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes.Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.


Review for NeurIPS paper: Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

There are many small language mistakes, mostly in the technical section (Section 3), but they are not the main problem. The proposed method is simple (which, again, is something good), but somehow it is difficult to understand from the text. I try to detail below what could be changed to improve the text clarity: - Calling "Cross-view transformers" the mapping functions used in the constraint term is confusing, as "transformer" means other thing in deep learning (transformers in NLP, spatial transformers) - Section 3.4 (about the transformers) mentions features, while in fact it is the final outputs that are "transformed" - it is not said explicitly that the weights in Eq (1) are learned in Section 3.4 - Eqs (3) to (6) seem to use the Euclidean(?) norm, while the authors probably meant some similarity functions; - Eqs (6) is disconnected from the text - Figure 1 is very dense and it is difficult to understand the method from it, while it should be possible to convey visually the method in a simple way - mentioning the Hough transform to explain the method did not make the presentation more intuitive for me.


Review for NeurIPS paper: Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Neural Information Processing Systems

The paper proposes a method for LIDAR-based object detection that exploits cross-view consistency between bird's-eye view and range view point clouds of the scene. The two inputs are fed to separate neural networks trained with a loss function that includes a term that encourages consistency between the two representations. Evaluations demonstrate strong performance compared to baselines on NuScenes. The paper was reviewed by four knowledgeable referees, who read the author response and subsequently discussed the paper. The reviewers agree that the manner in which the method exploits the bird's-eye and range views is interesting and elegant, namely the HCS voxel representation that enables feature extraction for both views and the manner in which the method enforces consistency on the transformed feature representations.