Token-wise Influential Training Data Retrieval for Large Language Models
Lin, Huawei, Long, Jikai, Xu, Zhaozhuo, Zhao, Weijie
–arXiv.org Artificial Intelligence
Given a Large Language Model (LLM) generation, how can we identify which training data led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to LLMs for estimating the influence of each training data. The proposed framework consists of two stages: caching and retrieval. First, we compress the gradient vectors by over 200,000x, allowing them to be cached on disk or in GPU/CPU memory. Then, given a generation, RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports multi-GPU parallelization to substantially accelerate caching and retrieval. Our empirical result confirms the efficiency and effectiveness of RapidIn.
arXiv.org Artificial Intelligence
May-19-2024
- Country:
- Asia (1.00)
- Europe (0.93)
- North America
- Canada (0.68)
- United States > California
- San Francisco County > San Francisco (0.14)
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Health & Medicine
- Consumer Health (1.00)
- Epidemiology (0.94)
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area
- Immunology (1.00)
- Infections and Infectious Diseases (1.00)
- Vaccines (1.00)
- Health & Medicine
- Technology: