CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

Sanghi, Aditya, Chu, Hang, Lambourne, Joseph G., Wang, Ye, Cheng, Chin-Yi, Fumero, Marco

arXiv.org Artificial Intelligence 

While recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zeroshot text-to-shape generation based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method not only demonstrates promising zero-shot generalization, but also avoids expensive inference time optimization and can generate multiple shapes for a given text. "a cuboid sofa" "a round sofa" "an airplane" "a space shuttle" "an suv" "a pickup truck" Figure 1: CLIP-Forge generates meaningful shapes without using any shape-text pairing labels.