Context and Geometry Aware Voxel Transformer for Semantic Scene Completion
–Neural Information Processing Systems
Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparseto-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable crossattention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates.
Neural Information Processing Systems
May-28-2025, 07:04:33 GMT