A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

Perełkiewicz, Michał, Poświata, Rafał

arXiv.org Artificial Intelligence 

The advent of large language models (LLMs) has heralded a new era in natural language processing (NLP), offering capabilities that range from sophisticated text generation to nuanced language understanding. These advancements have been propelled by significant improvements in model architectures, algorithms, and, crucially, the availability of extensive datasets for training. Given the data-intensive nature of these models, the quest for high-quality, diverse, and substantial datasets has become paramount. In this context, massive web-mined corpora have emerged as a vital resource, offering an abundance of textual data that mirrors the vastness and variety of human language and interaction [22, 35, 37, 42]. The internet, with its exponential growth and dynamic content, presents a near-infinite source of text data, spanning every conceivable topic, language, and style. This richness makes web-mined data an attractive foundation for training LLMs, aiming to equip them with a broad understanding of language and its applications. However, the use of such data is not without its challenges. The process of web mining--extracting data from websites--entails navigating a complex landscape of technical, legal, ethical, and quality-related issues [12, 13, 15, 43, 46]. By critically examining the use of web-mined corpora in the pre-training of LLMs, this article contributes to a nuanced understanding of the current landscape and future directions in large-scale language model development.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found