Label-Efficient Grasp Joint Prediction with Point-JEPA

Guzelkabaagac, Jed, Petrović, Boris

arXiv.org Artificial Intelligence 

Abstract--We study whether 3D self-supervised pretraining with Point-JEPA enables label-efficient grasp joint-angle prediction. Meshes are sampled to point clouds and tokenized; a ShapeNet-pretrained Point-JEPA encoder feeds a K=5 multi-hypothesis head trained with winner-takes-all and evaluated by top-logit selection. On a multi-finger hand dataset with strict object-level splits, Point-JEPA improves top-logit RMSE and Coverage@15 in low-label regimes (e.g., 26% lower RMSE at 25% data) and reaches parity at full supervision, suggesting JEPA-style pretraining is a practical lever for data-efficient grasp learning. Self-supervised learning (SSL) for 3D data has largely progressed along three directions. On point clouds this includes point/voxel masked autoencoding; e.g., V oxel-MAE reconstructs masked voxels for sparse automotive LiDAR and improves downstream tasks with fewer labels [1]-[4].