Enabling On-Device Medical AI Assistants via Input-Driven Saliency Adaptation

Kallakurik, Uttej, Humes, Edward, Jonna, Rithvik, Lin, Xiaomin, Mohsenin, Tinoosh

Aug-8-2025–arXiv.org Artificial Intelligence

--Large Language Models (LLMs) have significant impact on the healthcare scenarios but remain prohibitively large for deployment in real-time, resource-constrained environments such as edge devices. In this work, we introduce a novel medical assistant system, optimized through our general-purpose compression framework, which tailors Large Language Models (LLMs) for deployment in specialized domains. By measuring neuron saliency on domain-specific data, our method can aggressively prune irrelevant neurons, reducing model size while preserving performance. Following pruning, we apply post-training quantization to further reduce the memory footprint, and evaluate the compressed model across medical benchmarks including MedMCQA, MedQA, and PubMedQA. We also deploy the 50% compressed Gemma and the 67% compressed LLaMA3 models on Jetson Orin Nano (18.7W peak) and Raspberry Pi 5 (6.3W peak), achieving real-time, energy-efficient inference under hardware constraints.

large language model, machine learning, pruning, (17 more...)

arXiv.org Artificial Intelligence

Aug-8-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Maryland > Baltimore (0.04)

Genre:
- Research Report (0.64)

Industry:
- Health & Medicine (1.00)
- Information Technology > Hardware (0.51)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)
  - Natural Language > Large Language Model (1.00)