PrefixKV: Adaptive Prefix KVCache is What Vision Instruction-Following Models Need for Efficient Generation
–Neural Information Processing Systems
Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction.
Neural Information Processing Systems
Jun-19-2026, 10:11:39 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Transportation (0.46)
- Information Technology (0.46)
- Technology: