Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Marinas, Inés Altemir, Kucherenko, Anastasiia, Kucharavy, Andrei

Sep-1-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.

data mining, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

Sep-1-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- Europe > Switzerland (0.14)

Genre:
- Research Report (0.64)

Industry:
- Information Technology (0.93)
- Health & Medicine > Therapeutic Area
  - Infections and Infectious Diseases (0.93)
  - Immunology (0.71)

Technology:
- Information Technology
  - Information Management > Search (1.00)
  - Data Science > Data Mining (1.00)
  - Artificial Intelligence
    - Natural Language
      - Text Processing (1.00)
      - Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found