Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning Alex Jinpeng Wang

Neural Information Processing Systems 

Training models with longer in-context lengths is a significant challenge for multi-modal machine learning due to substantial GPU memory and computational costs.