Supplementary Material for Bridging the Domain Gap: Self-Supervised 3DScene Understanding with Foundation Models Anonymous Author(s) Affiliation Address email
–Neural Information Processing Systems
The masking strategy is set to random and the mask4 ratio m is 60 %.5 Embedding: To embed each masked point patch, the Point-MAE method substitutes it with a mask6 token that is learnable and weighted-shared. Meanwhile, for unmasked point patches (i.e., those that7 are visible), Point-MAE employs a lightweight PointNet [8] to extract features from the point patches.8 The visible point patches Pv are hence embedded into visible tokens Tv:9 Tv = PointNet(Pv) (1) Backbone: The backbone of Point-MAE is entirely based on standard Transformers, with an10 asymmetric encoder-decoder. The encoder takes visible tokens Tv as input to generate encoded11 tokens Te. In addition, Point-MAE incorporates positional embeddings into each Transformer block,12 thereby adding location-based information.
Neural Information Processing Systems
Apr-30-2026, 09:17:13 GMT
- Technology: