Goto

Collaborating Authors

 uk 0


1c6bed78d3813886d3d72595dbecb80b-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing Systems

Table 4 contains the full set of topics for the k " 30LDA model introduced in 4.406 Table 4: LDA[6] topic modeling outputs (k=30 topics) when trained on a random sample of documents from mmc4. Topic frequencies are determined by taking the mean distribution over documents in the corpus. Topic names are generated by GPT-4 conditioned on the top 20 words for each topic, prompted by a request for a short 1-2 word summary. Table 5 and Table 6 list the top-50 most frequent top-level domains for documents and images as408 discussed in 4. We show domain statistics in both mmc4and mmc4-core.409 The symbol "*" is employed to denote specific patterns, such as digits or location acronyms, commonly utilized to differentiate sub-sites within the same domain.


Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Neural Information Processing Systems

This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus2 with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features [24], a process that we show outperforms alternatives.


A Dataset Card

Neural Information Processing Systems

Table 4 contains the full set of topics for the k " 30 LDA model introduced in 4. Personal 7.96% ive, didnt, thing, bit, thought, week, wanted, started, pretty, id Art 2.70% art, design, de, images, ikea, image, painting, collection, piano, photo 14 C Most Frequent T op-Level Domains Figure 8: Manually labeled images with watermarks and images related to logos or ads. Sentence Image CLIP Similarity Our new service for teams to manage their fleets for racing.