Matryoshka Query Transformer for Large Vision-Language Models
–Neural Information Processing Systems
Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings.
Neural Information Processing Systems
May-27-2025, 02:13:55 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Natural Language (0.86)
- Vision (0.63)
- Information Technology > Artificial Intelligence