Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

Open in new window