Seeing in Words: Learning to Classify through Language Bottlenecks
Saifullah, Khalid, Wen, Yuxin, Geiping, Jonas, Goldblum, Micah, Goldstein, Tom
–arXiv.org Artificial Intelligence
In contrast, humans can explain their predictions using succinct and intuitive descriptions. To incorporate explainability into neural networks, we train a vision model whose feature representations are text. We show that such a model can effectively classify ImageNet images, and we discuss the challenges we encountered when training it. In recent years, there has been a surge of interest in vision-language models (VLMs) that combine the power of computer vision and natural language processing to perform tasks such as image captioning, visual question answering, and image retrieval (Alayrac et al., 2022; Radford et al., 2021; Li et al., 2022b; Wang et al., 2022; Zeng et al., 2021; Singh et al., 2022). These models leverage both visual and textual signals to reason about their inputs and generate meaningful outputs (Li et al., 2022a; Xu et al., 2015; Anderson et al., 2018; Li et al., 2019; Zhou et al., 2020; Li et al., 2020). One popular approach to building VLMs is through self-supervised learning (SSL), which involves training a model to make predictions about a given input without any human-labeled annotations.
arXiv.org Artificial Intelligence
Jun-28-2023
- Country:
- North America > United States (0.47)
- Genre:
- Research Report (0.83)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.50)
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence