Plotting

 Belli, Davide


Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

arXiv.org Artificial Intelligence

While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU, which result in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices. DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46% reduction in memory and 40% increase in throughput with $<$ 0.1 loss in perplexity.


GNSS Positioning using Cost Function Regulated Multilateration and Graph Neural Networks

arXiv.org Artificial Intelligence

He obtained his Ph.D. in Electrical Engineering from Eindhoven University of Technology in 2016. His research interests include applications of deep learning in positioning, navigation and RF signal processing systems. Davide Belli received his M.S. degree in Artificial Intelligence from the University of Amsterdam in 2019. He is currently a Senior Machine Learning Researcher at Qualcomm AI Research. His research interests include deep learning for the visual and RF domain, model personalization, and graph representation learning. Bence Major is a Staff Engineer at Qualcomm AI Research, leading a research team in the use of artificial intelligence for RF sensing and positioning. His research work focuses on non-visual sensory data, such as radar, ultrasound, and wireless signals. He received his M.S. degree in Computer Science from the Budapest University of Technology and Economics. Songwon Jee received his M.S. degree in Electrical Engineering from Stanford University in 2016. He is currently a Senior Staff Engineer in Location Technology Team at Qualcomm Technology Inc. His research interests include the application of deep learning for location technology involving GNSS, sensors, and wireless technologies. Himanshu Shah received his M.S. and Ph.D. degrees in Electrical Engineering from Arizona State University in 2004 and 2009 respectively.


Chest X-ray Inpainting with Deep Generative Models

arXiv.org Machine Learning

Generative adversarial networks have been successfully applied to inpainting in natural images. However, the current state-of-the-art models have not yet been widely adopted in the medical imaging domain. In this paper, we investigate the performance of three recently published deep learning based inpainting models: context encoders, semantic image inpainting, and the contextual attention model, applied to chest x-rays, as the chest exam is the most commonly performed radiological procedure. We train these generative models on 1.2M 128 $\times$ 128 patches from 60K healthy x-rays, and learn to predict the center 64 $\times$ 64 region in each patch. We test the models on both the healthy and abnormal radiographs. We evaluate the results by visual inspection and comparing the PSNR scores. The outputs of the models are in most cases highly realistic. We show that the methods have potential to enhance and detect abnormalities. In addition, we perform a 2AFC observer study and show that an experienced human observer performs poorly in detecting inpainted regions, particularly those generated by the contextual attention model.