Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Open in new window