Image Captioning Using Hugging Face Vision Encoder Decoder -- Step 2 Step Guide (Part 2)
In the previous article, we discussed in brief about encoder decoders and our approach towards solving the task of captioning. We fine-tuned a language model, this allowed the decoder to learn new words, generate brief captions and save training time. This can be referred as priming our decoder before actual training on the captioning task. Before we dirty our hands with the code, let us understand how the Vision Encoder Decoder module connects the two models (Image Encoder and Text Sequence generator) and how it deciphers what is present in the image. To understand this, you need a basic understanding of how transformer attention works and terminologies like KEY, QUERY & VALUE.
Jul-18-2022, 18:20:31 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.37)
- Natural Language (0.37)
- Vision (0.51)
- Information Technology > Artificial Intelligence