Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression
–arXiv.org Artificial Intelligence
This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as $n:m$ sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.
arXiv.org Artificial Intelligence
Apr-9-2025
- Country:
- Asia > Middle East
- Saudi Arabia > Mecca Province > Thuwal (0.04)
- Europe > Italy
- Asia > Middle East
- Genre:
- Research Report > Promising Solution (0.45)
- Technology: