Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation

Yu, Xihang, Talak, Rajat, Shi, Jingnan, Viereck, Ulrich, Gilitschenski, Igor, Carlone, Luca

arXiv.org Artificial Intelligence 

Modern warehouse automation systems rely on fleets of intelligent robots that generate vast amounts of data -- most of which remains unannotated. This paper develops a self-supervised domain adaptation pipeline that leverages real-world, unlabeled data to improve perception models without requiring manual annotations. Our work focuses specifically on estimating the pose and shape of boxes and presents a correct-and-certify pipeline for self-supervised box pose and shape estimation. We extensively evaluate our approach across a range of simulated and real industrial settings, including adaptation to a large-scale real-world dataset of 50,000 images. The self-supervised model significantly outperforms models trained solely in simulation and shows substantial improvements over a zero-shot 3D bounding box estimation baseline. Keywords: Certifiable models, computer vision, 3D robot vision, object pose estimation, safe perception, self-supervised learning.