CLiMB: AContinualLearningBenchmark forVision-and-Language Tasks

Neural Information Processing Systems 

This assumption means learning separate models for language-only, vision-only, and vision-language tasks, as opposed to a single "generalist" model that can handle all modalities or subsets of them [Reed et al., 2022]. Yet, existing work suggests that knowledge grounded in multiple modalities can benefit unimodal tasks [Desai and Johnson, 2021, Jin et al., 2022].