Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning Brandon Huang 1* Chancharik Mitra 1* Leonid Karlinsky 3

Neural Information Processing Systems 

The recent success of interleaved Large Multimodal Models (LMMs) in fewshot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens.