Token-wise Influential Training Data Retrieval for Large Language Models
Lin, Huawei, Long, Jikai, Xu, Zhaozhuo, Zhao, Weijie
–arXiv.org Artificial Intelligence
Given a Large Language Model (LLM) generation, how can we identify which training data led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to LLMs for estimating the influence of each training data. The proposed framework consists of two stages: caching and retrieval. First, we compress the gradient vectors by over 200,000x, allowing them to be cached on disk or in GPU/CPU memory. Then, given a generation, RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports multi-GPU parallelization to substantially accelerate caching and retrieval. Our empirical result confirms the efficiency and effectiveness of RapidIn.
arXiv.org Artificial Intelligence
May-19-2024
- Country:
- Africa > Madagascar (0.04)
- Pacific Ocean (0.04)
- Oceania > Australia
- Victoria > Melbourne (0.04)
- New South Wales > Sydney (0.04)
- North America
- Dominican Republic (0.04)
- United States
- Oregon (0.04)
- Maryland > Baltimore (0.04)
- Nevada (0.04)
- Texas > Dallas County
- Dallas (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California
- San Francisco County > San Francisco (0.14)
- Los Angeles County > Long Beach (0.04)
- Canada
- Ontario > Toronto (0.04)
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- United Kingdom > England (0.04)
- Austria (0.04)
- Italy
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Asia
- Japan (0.04)
- India (0.04)
- Singapore (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- China
- Xinjiang Uygur Autonomous Region (0.04)
- Beijing > Beijing (0.04)
- Afghanistan > Parwan Province
- Charikar (0.04)
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Health & Medicine
- Pharmaceuticals & Biotechnology (1.00)
- Consumer Health (1.00)
- Epidemiology (0.94)
- Therapeutic Area
- Vaccines (1.00)
- Infections and Infectious Diseases (1.00)
- Immunology (1.00)
- Health & Medicine
- Technology: