FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
Son, Youngjun, Kim, Chaewon, Lee, Jaejin
–arXiv.org Artificial Intelligence
Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework \sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. \sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (https://github.com/mcrl/FED).
arXiv.org Artificial Intelligence
Jan-1-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Information Technology (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Large Language Model (0.67)
- Data Science (1.00)
- Graphics (1.00)
- Hardware (1.00)
- Artificial Intelligence
- Information Technology