Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Open in new window