Accelerating Retrieval-Augmented Generation
Quinn, Derrick, Nouri, Mohammad, Patel, Neel, Salihu, John, Salemi, Alireza, Lee, Sukhan, Zamani, Hamed, Alian, Mohammad
–arXiv.org Artificial Intelligence
An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4-27.9x faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7-26.3x lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM, which is the most expensive component in today's servers, from being stranded.
arXiv.org Artificial Intelligence
Dec-14-2024
- Country:
- North America
- Dominican Republic (0.04)
- United States
- District of Columbia > Washington (0.04)
- Texas > Travis County
- Austin (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Florida > Orange County
- Orlando (0.04)
- Idaho > Ada County
- Boise (0.04)
- California > San Diego County
- La Jolla (0.04)
- New York
- New York County > New York City (0.05)
- Tompkins County > Ithaca (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Kansas > Douglas County
- Lawrence (0.04)
- Massachusetts
- Hampshire County > Amherst (0.14)
- Middlesex County > Cambridge (0.04)
- Canada
- Quebec > Montreal (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Switzerland (0.04)
- Spain
- Galicia > Madrid (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Italy > Tuscany
- Florence (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Austria > Styria
- Graz (0.04)
- Asia
- Thailand > Bangkok
- Bangkok (0.04)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- China > Yunnan Province
- Kunming (0.04)
- Thailand > Bangkok
- North America
- Genre:
- Research Report (0.82)
- Industry:
- Information Technology (1.00)
- Technology: