Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Open in new window