Matryoshka Query Transformer for Large Vision-Language Models

May-27-2025, 02:13:55 GMT–Neural Information Processing Systems

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings.

artificial intelligence, matryoshka query transformer, natural language, (3 more...)

Neural Information Processing Systems

May-27-2025, 02:13:55 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.86)
  - Vision (0.63)