ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction
Thapliyal, Apoorv, Lanka, Vinay, Baskaran, Swathi
–arXiv.org Artificial Intelligence
Our approach leverages Vision Transformers (ViT) to extract rich semantic features from input images, while a point cloud tokenizer --utilizing Farthest Point Sampling (FPS) and K-Nearest Neighbors (KNN)--captures local geometric details. These multimodal features are combined using a learnable Cross-Attention module, which facilitates effective interaction between the two modalities. A transformer-based decoder is then employed to reconstruct high-fidelity point clouds. The model is trained with Chamfer Distance (L1/L2) as the loss function, ensuring precise alignment between reconstructed outputs and ground truth data. Experimental evaluations on standard benchmark datasets, including ShapeNet, demonstrate that ObitoNet achieves comparable performance to state-of-the-art methods in point cloud reconstruction.
arXiv.org Artificial Intelligence
Dec-24-2024