A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

Jul-10-2024–arXiv.org Artificial Intelligence

The advent of large language models (LLMs) has heralded a new era in natural language processing (NLP), offering capabilities that range from sophisticated text generation to nuanced language understanding. These advancements have been propelled by significant improvements in model architectures, algorithms, and, crucially, the availability of extensive datasets for training. Given the data-intensive nature of these models, the quest for high-quality, diverse, and substantial datasets has become paramount. In this context, massive web-mined corpora have emerged as a vital resource, offering an abundance of textual data that mirrors the vastness and variety of human language and interaction [22, 35, 37, 42]. The internet, with its exponential growth and dynamic content, presents a near-infinite source of text data, spanning every conceivable topic, language, and style. This richness makes web-mined data an attractive foundation for training LLMs, aiming to equip them with a broad understanding of language and its applications. However, the use of such data is not without its challenges. The process of web mining--extracting data from websites--entails navigating a complex landscape of technical, legal, ethical, and quality-related issues [12, 13, 15, 43, 46]. By critically examining the use of web-mined corpora in the pre-training of LLMs, this article contributes to a nuanced understanding of the current landscape and future directions in large-scale language model development.

computational linguistic, corpora, language model, (12 more...)

arXiv.org Artificial Intelligence

Jul-10-2024

arXiv.org PDF

Add feedback

Country:
- Africa > Nigeria (0.04)
- South America > Colombia
  - Meta Department > Villavicencio (0.04)
- North America
  - United States > New York
    - New York County > New York City (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Poland > Masovia Province
    - Warsaw (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Bulgaria > Sofia City Province
    - Sofia (0.04)
- Asia
  - Singapore (0.04)
  - Philippines (0.04)
  - Pakistan (0.04)
  - Malaysia (0.04)
  - Indonesia > Bali (0.04)
  - India (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)

Genre:
- Overview (0.93)
- Research Report (0.82)

Industry:
- Government > Regional Government > North America Government > United States Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found