Image Captioners Are Scalable Vision Learners Too

Neural Information Processing Systems 

Transformer decoder has to predict all tokens at once, conditioned only on the image.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found