LLMs on a Budget? Say HOLA

Siddiqui, Zohaib Hasan, Gao, Jiechao, Shabbir, Ebad, Azeez, Mohammad Anas, Ali, Rafiq, Kashyap, Gautam Siddharth, Naseem, Usman

Oct-10-2025–arXiv.org Artificial Intelligence

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia > India > NCT (0.14)

Genre:
- Research Report (0.82)

Industry:
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found