A Data Collection and Details about the

Feb-10-2026, 11:13:31 GMT–Neural Information Processing Systems

We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB). The sources of data are basically classified into the following categories: (1) Professional image websites (both English and Chinese). The images in the websites are usually with captions. We have already introduced tokenizers in section 2.2, and here are some details. Colored grids are all the tokens attended to by the token marked "O".

caption, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Feb-10-2026, 11:13:31 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.48)
  - Machine Learning > Neural Networks (0.47)

Duplicate Docs Excel Report

Title
a4d92e2cd541fca87e4620aba658316d-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found