Matryoshka Query Transformer for Large Vision-Language Models

Neural Information Processing Systems 

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model.