Improving Robustness of Foundation Models in Domain Adaptation with Soup-Adapters
–arXiv.org Artificial Intelligence
Computer vision has seen tremendous progress due to the emergence of deep learning technologies. Large supervised benchmark datasets such as ImageNet (Deng et al. 2009) have led to several methodological breakthroughs. These include overcoming traditional computer vision methods in (Krizhevsky et al. 2012), the introduction of skip connections in (He et al. 2016), advanced architectures such as inverted bottlenecks in (San-dler et al. 2018) and improved scaling techniques in (Koonce and Koonce 2021). A long-standing limitation has been the dependence on such large curated datasets which are expensive to obtain. Recently, the paradigm of foundation models has become an attractive alternative in which a single model is being trained on a corpus of data large enough to generalize well on several distinct downstream tasks. One notable vision foundation model is CLIP (Rad-ford et al. 2021) which learns a joint embedding space of images and their corresponding captions. This architecture naturally has the ability to perform zero-shot classification by describing visual categories via text prompts. Another popular foundation model is DINOv2 (Oquab et al. 2023) which has been trained on a large curated corpus of images to produce robust features. These models can be easily adapted for few-shot learning using KNN evaluation or prototypical learning (Snell et al. 2017).
arXiv.org Artificial Intelligence
Jul-9-2025
- Genre:
- Research Report (0.64)
- Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (0.46)
- Technology: