Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text
–Neural Information Processing Systems
This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus2 with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features [24], a process that we show outperforms alternatives.
Neural Information Processing Systems
Apr-25-2026, 14:22:33 GMT
- Country:
- Europe (0.46)
- North America > United States (0.28)
- Industry:
- Health & Medicine (0.67)
- Information Technology > Services (0.46)
- Technology:
- Information Technology
- Sensing and Signal Processing > Image Processing (1.00)
- Communications (0.94)
- Artificial Intelligence
- Vision (1.00)
- Natural Language (1.00)
- Machine Learning (1.00)
- Information Technology