Image Captioners Are Scalable Vision Learners Too

Oct-9-2025, 01:37:36 GMT–Neural Information Processing Systems

Transformer decoder has to predict all tokens at once, conditioned only on the image.

cappa, large language model, machine learning, (22 more...)

Neural Information Processing Systems

Oct-9-2025, 01:37:36 GMT

Conferences PDF

Country:
- North America > United States > Illinois > Cook County > Chicago (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Leisure & Entertainment (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
92369a01fbe8046a093746389b2c413e-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found