efficient deployment
Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware
Mueller, Lion, Garcia-Ortiz, Alberto, Najafi, Ardalan, Fuks, Adam, Bamberg, Lennart
Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- Europe > Germany > Bremen > Bremen (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- (2 more...)
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment
Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
Lin, Guanyu, Feng, Tao, Han, Pengrui, Liu, Ge, You, Jiaxuan
As scientific research proliferates, researchers face the daunting task of navigating and reading vast amounts of literature. Existing solutions, such as document QA, fail to provide personalized and up-to-date information efficiently. We present Paper Copilot, a self-evolving, efficient LLM system designed to assist researchers, based on thought-retrieval, user profile and high performance optimization. Specifically, Paper Copilot can offer personalized research services, maintaining a real-time updated database. Quantitative evaluation demonstrates that Paper Copilot saves 69.92\% of time after efficient deployment. This paper details the design and implementation of Paper Copilot, highlighting its contributions to personalized academic support and its potential to streamline the research process.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States > Illinois (0.04)
- North America > Canada > Quebec (0.04)
EDAC: Efficient Deployment of Audio Classification Models For COVID-19 Detection
Jovanović, Andrej, Mihaly, Mario, Donaldson, Lennon
The global spread of COVID-19 had severe consequences for public health and the world economy. The quick onset of the pandemic highlighted the potential benefits of cheap and deployable pre-screening methods to monitor the prevalence of the disease in a population. Various researchers made use of machine learning methods in an attempt to detect COVID-19. The solutions leverage various input features, such as CT scans or cough audio signals, with state-of-the-art results arising from deep neural network architectures. However, larger models require more compute; a pertinent consideration when deploying to the edge. To address this, we first recreated two models that use cough audio recordings to detect COVID-19. Through applying network pruning and quantisation, we were able to compress these two architectures without reducing the model's predictive performance. Specifically, we were able to achieve an 105.76x and an 19.34x reduction in the compressed model file size with corresponding 1.37x and 1.71x reductions in the inference times of the two models.
Welcome! You are invited to join a webinar: tinyML Talks webcast: 1) Qeexo's Runtime-Free Architecture for Efficient Deployment 2) Democratization of Artificial Intelligence (AI) to Small Scale Farmers. After registering, you will receive a confirmation email about joining the webinar.
"Qeexo’s Runtime-Free Architecture for Efficient Deployment of Neural Networks on Embedded Targets" Rajen Bhatt Director of Engineering Machine Learning, Qeexo Co Neural networks, including convolutional, feed-forward, recurrent, and convolutional-recurrent, are increasingly popular due to their recent successes in AI applications. Developing neural network models for tinyML applications can be very cumbersome due to constraints of embedded targets having low-power MCUs. Qeexo has developed a runtime-free architecture for efficiently converting TensorFlow-and-PyTorch-generated models to target libraries. This approach builds models which are orders of magnitude smaller than TensorFlow Lite Micro and does not compromise on latency or inference performance. "Democratization of Artificial Intelligence (AI) to Small Scale Farmers - a framework to deploy AI Models to Tiny IoT Edges that operate in constrained environments" Chandrasekar Vuppalapati Senior Vice President - Products & Programs Hanumayamma Innovations and Technologies Inc. Big Data surrounds us. Every minute, our smartphone collects huge amounts of data from geolocations to the next clickable item on an ecommerce site. Data has become one of the most important commodities for individuals and companies. Nevertheless, this data revolution has not touched every economic sector, especially rural economies, e.g., small farmers have been largely passed over the data revolution, in the developing countries due to infrastructure and compute constrained environments. Not only isthis a huge missed opportunity for big data companies, it is one of the significant obstacles in the path towards sustainable food and a huge inhibitor closing economic disparities. The purpose of the talk is to present the TinyML framework to deploy artificial intelligence models in constrained compute environments that enable remote rural areas and small farmers to join the data revolution.
Big Data In Healthcare: Paris Hospitals Predict Admission Rates Using Machine Learning
Hospitals in Paris are trialling Big Data and machine learning systems designed to forecast admission rates – leading to more efficient deployment of resources and better patient outcomes. The result was the first contribution to an open source framework of code designed to carry out the analysis over a scalable, distributed framework. Machine learning is employed to determine which algorithms provide the best indicator of future trends, when they are fed data from the past. The core of the analytics work involves using time series analysis techniques – looking for ways in which patterns in the data can be used to predict the admission rates at different times. This code is already being put to use in several other projects involving healthcare and finance.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.94)