locked-image text tuning
Brief Review -- LiT: Zero-Shot Transfer with Locked-image text Tuning
The proposed model significantly outperforms the previous state-of-the-art methods at ImageNet zero-shot classification. There are 8.3% and 8.1% improvement over CLIP and ALIGN, respectively. With a pre-trained image model, the proposed setup converges significantly faster than the standard from-scratch setups reported in the literature. LiT provides a way to reuse the already pre-trained models in the literature. It is evident that locking the image tower almost always works best and using a pre-trained image tower significantly helps across the board, whereas using a pre-trained text tower only marginally improves performance, and locking the text tower does not work well.
Genre:
- Research Report (0.47)
- Overview (0.40)
Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)