FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

Son, Youngjun, Kim, Chaewon, Lee, Jaejin

Jan-1-2025–arXiv.org Artificial Intelligence

Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework \sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. \sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (https://github.com/mcrl/FED).

dataset, deduplication, hash function, (15 more...)

arXiv.org Artificial Intelligence

Jan-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Ireland
  - Leinster > County Dublin > Dublin (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - South Korea > Seoul
    - Seoul (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Information Technology (1.00)

Technology:
- Information Technology
  - Hardware (1.00)
  - Graphics (1.00)
  - Data Science (1.00)
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language > Large Language Model (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found