A Generalization Theory for Zero-Shot Prediction
In 2021, OpenAI shocked the world by improving the zero-shot classification accuracy on ImageNet from 11.5% to 76.2% via the CLIP series of models (Radford et al., 2021). This event redefined the goal of zero-shot prediction from producing models that generalized to unseen classes to those that generalized to unseen tasks entirely. Two fundamental drivers of CLIP's success were 1) the use of natural language as a medium for representing arbitrary classes (as in the previous state-of-the-art Visual N-grams (Li et al., 2017)), and 2) a massive, yet carefully designed pre-training set which significantly impacted downstream performance Radford et al. (2021); Fang et al. (2023); Xu et al. (2024). Despite the remarkable success of these foundation model-based pipelines Bommasani et al. (2022), there are unique components of zero-shot prediction that warrant investigation from a theoretical point of view. To clarify these gaps, we contrast zero-shot prediction (ZSP) with the related setting of few-shot learning (FSL). Let x X denote an input (often an image) that accompanies a discrete value y Y (often a class label).
Jul-15-2025
- Country:
- North America > United States
- Washington > King County > Seattle (0.04)
- Europe
- Poland (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Cambridgeshire > Cambridge (0.04)
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Genre:
- Workflow (0.45)
- Research Report (0.40)
- Technology: