ThinK: Thinner Key Cache by Query-Driven Pruning

Xu, Yuhui, Jie, Zhanming, Dong, Hanze, Wang, Lei, Lu, Xudong, Zhou, Aojun, Saha, Amrita, Xiong, Caiming, Sahoo, Doyen

arXiv.org Artificial Intelligence 

Large language models (LLMs) (Hadi et al., 2023; Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023a,b; Scao et al., 2022; Reid et al., 2024) have emerged as a dominant paradigm in natural language processing, achieving state-of-the-art performance across various tasks. A key principle, the Scaling Law (Kaplan et al., 2020), suggests that LLMs exhibit emergent abilities as model size increases, enhancing their capacity to understand context and handle long sequences (Xiong et al., 2023). This capacity growth allows LLMs to generate coherent and contextually accurate responses and enables various downstream applications, such as document summarization (Zhang et al., 2019, 2024a), code generation (Chen et al., 2021b), and conversational AI (Bordes et al., 2016; OpenAI, 2022),. Despite their success in various applications, the generation of LLMs incurs significant expenses, which escalate with increasing model size and sequence length. Notably, both the training (Strubell et al., 2020; Hoffmann et al., 2022; Dong et al., 2024a) and inference (Ainslie et al., 2023) stages involve frequent generation by LLMs, further contributing to these costs. Consequently, efficient LLMs have gained popularity in recent years (Hu et al., 2021; Wan et al., 2023). To address these challenges, quantization (Frantar et al., 2022; Lin et al., 2024; Dettmers et al., 2024; Xu et al., 2023) and pruning methods (Frankle and Carbin, 2018; Blalock et al., 2020) are employed to reduce model size. Additionally, managing long sequences presents another cost due to the transformer attention mechanism.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found