DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Neural Information Processing Systems 

These gains are particularly pronounced on high-resolution tasks, e.g ., 4.2, 11.0, and 4.0 improvements on TextVQA, DocVQA, and InfoVQA compared to LLaV A-1.5-7B, respectively.