ObitoNet: Multimodal High-Resolution Point Cloud Reconstruction

Thapliyal, Apoorv, Lanka, Vinay, Baskaran, Swathi

Dec-24-2024–arXiv.org Artificial Intelligence

Our approach leverages Vision Transformers (ViT) to extract rich semantic features from input images, while a point cloud tokenizer --utilizing Farthest Point Sampling (FPS) and K-Nearest Neighbors (KNN)--captures local geometric details. These multimodal features are combined using a learnable Cross-Attention module, which facilitates effective interaction between the two modalities. A transformer-based decoder is then employed to reconstruct high-fidelity point clouds. The model is trained with Chamfer Distance (L1/L2) as the loss function, ensuring precise alignment between reconstructed outputs and ground truth data. Experimental evaluations on standard benchmark datasets, including ShapeNet, demonstrate that ObitoNet achieves comparable performance to state-of-the-art methods in point cloud reconstruction.

artificial intelligence, machine learning, point cloud, (18 more...)

arXiv.org Artificial Intelligence

Dec-24-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.84)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Neural Networks > Deep Learning (0.68)
      - Statistical Learning > Nearest Neighbor Methods (0.55)
    - Vision (1.00)
  - Sensing and Signal Processing > Image Processing (1.00)