Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Apr-24-2026, 04:17:19 GMT–Neural Information Processing Systems

We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Apr-24-2026, 04:17:19 GMT

Conferences PDF

Add feedback

Industry:
- Education > Curriculum > Subject-Specific Education (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.92)
  - Vision > Image Understanding (0.75)
  - Machine Learning
    - Neural Networks > Deep Learning (0.46)
    - Performance Analysis > Accuracy (0.41)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found