WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

He, Conghui, Jin, Zhenjiang, Xu, Chao, Qiu, Jiantao, Wang, Bin, Li, Wei, Yan, Hang, Wang, Jiaqi, Lin, Dahua

Sep-15-2023–arXiv.org Artificial Intelligence

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Sep-15-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China > Shanghai > Shanghai (0.09)

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.89)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found