Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Neural Information Processing Systems 

This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus2 with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features [24], a process that we show outperforms alternatives.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found