ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

Neural Information Processing Systems 

A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (, class name and bounding box) and 3D information (, 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categories. However, existing datasets with object-level 3D annotations are often limited by the number of categories or the quality of annotations. Models developed on these datasets become specialists for certain categories or domains, and fail to generalize.