Seeing in Words: Learning to Classify through Language Bottlenecks

Saifullah, Khalid, Wen, Yuxin, Geiping, Jonas, Goldblum, Micah, Goldstein, Tom

Jun-28-2023–arXiv.org Artificial Intelligence

In contrast, humans can explain their predictions using succinct and intuitive descriptions. To incorporate explainability into neural networks, we train a vision model whose feature representations are text. We show that such a model can effectively classify ImageNet images, and we discuss the challenges we encountered when training it. In recent years, there has been a surge of interest in vision-language models (VLMs) that combine the power of computer vision and natural language processing to perform tasks such as image captioning, visual question answering, and image retrieval (Alayrac et al., 2022; Radford et al., 2021; Li et al., 2022b; Wang et al., 2022; Zeng et al., 2021; Singh et al., 2022). These models leverage both visual and textual signals to reason about their inputs and generate meaningful outputs (Li et al., 2022a; Xu et al., 2015; Anderson et al., 2018; Li et al., 2019; Zhou et al., 2020; Li et al., 2020). One popular approach to building VLMs is through self-supervised learning (SSL), which involves training a model to make predictions about a given input without any human-labeled annotations.

arxiv preprint arxiv, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Jun-28-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.47)

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.50)
  - Natural Language (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found