partition point
QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference
Li, Xiangchen, Ghafouri, Saeid, Ji, Bo, Vandierendonck, Hans, John, Deepu, Nikolopoulos, Dimitrios S.
As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future inference queries across diverse edge devices, we argue that planning an inference pattern with a request-specific model tailored to the device's computational capacity, accuracy requirements, and time constraints is more cost-efficient and robust to diverse scenarios. To this end, we propose an accuracy-aware and workload-balanced inference system that integrates joint model quantization and inference partitioning. In this approach, the server dynamically responds to inference queries by sending a quantized model and adaptively sharing the inference workload with the device. Meanwhile, the device's computational power, channel capacity, and accuracy requirements are considered when deciding. Furthermore, we introduce a new optimization framework for the inference system, incorporating joint model quantization and partitioning. Our approach optimizes layer-wise quantization bit width and partition points to minimize time consumption and cost while accounting for varying accuracy requirements of tasks through an accuracy degradation metric in our optimization model. To our knowledge, this work represents the first exploration of optimizing quantization layer-wise bit-width in the inference serving system, by introducing theoretical measurement of accuracy degradation. Simulation results demonstrate a substantial reduction in overall time and power consumption, with computation payloads decreasing by over 80% and accuracy degradation kept below 1%.
Energy Optimization of Multi-task DNN Inference in MEC-assisted XR Devices: A Lyapunov-Guided Reinforcement Learning Approach
Sun, Yanzan, Qiu, Jiacheng, Pan, Guangjin, Xu, Shugong, Zhang, Shunqing, Wang, Xiaoyun, Han, Shuangfeng
Extended reality (XR), blending virtual and real worlds, is a key application of future networks. While AI advancements enhance XR capabilities, they also impose significant computational and energy challenges on lightweight XR devices. In this paper, we developed a distributed queue model for multi-task DNN inference, addressing issues of resource competition and queue coupling. In response to the challenges posed by the high energy consumption and limited resources of XR devices, we designed a dual time-scale joint optimization strategy for model partitioning and resource allocation, formulated as a bi-level optimization problem. This strategy aims to minimize the total energy consumption of XR devices while ensuring queue stability and adhering to computational and communication resource constraints. To tackle this problem, we devised a Lyapunov-guided Proximal Policy Optimization algorithm, named LyaPPO. Numerical results demonstrate that the LyaPPO algorithm outperforms the baselines, achieving energy conservation of 24.79% to 46.14% under varying resource capacities. Specifically, the proposed algorithm reduces the energy consumption of XR devices by 24.29% to 56.62% compared to baseline algorithms.
Multiresolution Gaussian Processes
We propose a multiresolution Gaussian process to capture long-range, non-Markovian dependencies while allowing for abrupt changes and non-stationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Long-range dependencies are captured by the top-level GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efficient inference of the partition itself, for which we employ graph-theoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity.
FedSplitX: Federated Split Learning for Computationally-Constrained Heterogeneous Clients
Shin, Jiyun, Ahn, Jinhyun, Kang, Honggu, Kang, Joonhyuk
Foundation models (FMs) have demonstrated remarkable performance in machine learning but demand extensive training data and computational resources. Federated learning (FL) addresses the challenges posed by FMs, especially related to data privacy and computational burdens. However, FL on FMs faces challenges in situations with heterogeneous clients possessing varying computing capabilities, as clients with limited capabilities may struggle to train the computationally intensive FMs. To address these challenges, we propose FedSplitX, a novel FL framework that tackles system heterogeneity. FedSplitX splits a large model into client-side and server-side components at multiple partition points to accommodate diverse client capabilities. This approach enables clients to collaborate while leveraging the server's computational power, leading to improved model performance compared to baselines that limit model size to meet the requirement of the poorest client. Furthermore, FedSplitX incorporates auxiliary networks at each partition point to reduce communication costs and delays while enhancing model performance. Our experiments demonstrate that FedSplitX effectively utilizes server capabilities to train large models, outperforming baseline approaches.
When Computing Power Network Meets Distributed Machine Learning: An Efficient Federated Split Learning Framework
Yuan, Xinjing, Pu, Lingjun, Jiao, Lei, Wang, Xiaofei, Yang, Meijuan, Xu, Jingdong
In this paper, we advocate CPN-FedSL, a novel and flexible Federated Split Learning (FedSL) framework over Computing Power Network (CPN). We build a dedicated model to capture the basic settings and learning characteristics (e.g., training flow, latency and convergence). Based on this model, we introduce Resource Usage Effectiveness (RUE), a novel performance metric integrating training utility with system cost, and formulate a multivariate scheduling problem that maxi?mizes RUE by comprehensively taking client admission, model partition, server selection, routing and bandwidth allocation into account (i.e., mixed-integer fractional programming). We design Refinery, an efficient approach that first linearizes the fractional objective and non-convex constraints, and then solves the transformed problem via a greedy based rounding algorithm in multiple iterations. Extensive evaluations corroborate that CPN-FedSL is superior to the standard and state-of-the-art learning frameworks (e.g., FedAvg and SplitFed), and besides Refinery is lightweight and significantly outperforms its variants and de facto heuristic methods under a variety of settings.
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic
Niu, Tao, Teng, Yinglei, Han, Zhu, Zou, Panpan
Recently, the applications of deep neural network (DNN) have been very prominent in many fields such as computer vision (CV) and natural language processing (NLP) due to its superior feature extraction performance. However, the high-dimension parameter model and large-scale mathematical calculation restrict the execution efficiency, especially for Internet of Things (IoT) devices. Different from the previous cloud/edge-only pattern that brings huge pressure for uplink communication and device-only fashion that undertakes unaffordable calculation strength, we highlight the collaborative computation between the device and edge for DNN models, which can achieve a good balance between the communication load and execution accuracy. Specifically, a systematic on-demand co-inference framework is proposed to exploit the multi-branch structure, in which the pre-trained Alexnet is right-sized through \emph{early-exit} and partitioned at an intermediate DNN layer. The integer quantization is enforced to further compress transmission bits. As a result, we establish a new Deep Reinforcement Learning (DRL) optimizer-Soft Actor Critic for discrete (SAC-d), which generates the \emph{exit point}, \emph{partition point}, and \emph{compressing bits} by soft policy iterations. Based on the latency and accuracy aware reward design, such an optimizer can well adapt to the complex environment like dynamic wireless channel and arbitrary CPU processing, and is capable of supporting the 5G URLLC. Real-world experiment on Raspberry Pi 4 and PC shows the outperformance of the proposed solution.
Autodidactic Neurosurgeon: Collaborative Deep Inference for Mobile Edge Intelligence via Online Learning
Zhang, Letian, Chen, Lixing, Xu, Jie
Recent breakthroughs in deep learning (DL) have led to the emergence of many intelligent mobile applications and services, but in the meanwhile also pose unprecedented computing challenges on resource-constrained mobile devices. This paper builds a collaborative deep inference system between a resource-constrained mobile device and a powerful edge server, aiming at joining the power of both on-device processing and computation offloading. The basic idea of this system is to partition a deep neural network (DNN) into a front-end part running on the mobile device and a back-end part running on the edge server, with the key challenge being how to locate the optimal partition point to minimize the end-to-end inference delay. Unlike existing efforts on DNN partitioning that rely heavily on a dedicated offline profiling stage to search for the optimal partition point, our system has a built-in online learning module, called Autodidactic Neurosurgeon (ANS), to automatically learn the optimal partition point on-the-fly. Therefore, ANS is able to closely follow the changes of the system environment by generating new knowledge for adaptive decision making. The core of ANS is a novel contextual bandit learning algorithm, called $\mu$LinUCB, which not only has provable theoretical learning performance guarantee but also is ultra-lightweight for easy real-world implementation. We implement our system on a video stream object detection testbed to validate the design of ANS and evaluate its performance. The experiments show that ANS significantly outperforms state-of-the-art benchmarks in terms of tracking system changes and reducing the end-to-end inference delay.
Bayesian Nonparametric Adaptive Spectral Density Estimation for Financial Time Series
James, Nick, Marchant, Roman, Gerlach, Richard, Cripps, Sally
Discrimination between non-stationarity and long-range dependency is a difficult and long-standing issue in modelling financial time series. This paper uses an adaptive spectral technique which jointly models the non-stationarity and dependency of financial time series in a non-parametric fashion assuming that the time series consists of a finite, but unknown number, of locally stationary processes, the locations of which are also unknown. The model allows a non-parametric estimate of the dependency structure by modelling the auto-covariance function in the spectral domain. All our estimates are made within a Bayesian framework where we use aReversible Jump Markov Chain Monte Carlo algorithm for inference. We study the frequentist properties of our estimates via a simulation study, and present a novel way of generating time series data from a nonparametric spectrum. Results indicate that our techniques perform well across a range of data generating processes. We apply our method to a number of real examples and our results indicate that several financial time series exhibit both long-range dependency and non-stationarity.
Auto-tuning Neural Network Quantization Framework for Collaborative Inference Between the Cloud and Edge
Li, Guangli, Liu, Lei, Wang, Xueying, Dong, Xiao, Zhao, Peng, Feng, Xiaobing
Recently, deep neural networks (DNNs) have been widely applied in mobile intelligent applications. The inference for the DNNs is usually performed in the cloud. However, it leads to a large overhead of transmitting data via wireless network. In this paper, we demonstrate the advantages of the cloud-edge collaborative inference with quantization. By analyzing the characteristics of layers in DNNs, an auto-tuning neural network quantization framework for collaborative inference is proposed. We study the effectiveness of mixed-precision collaborative inference of state-of-the-art DNNs by using ImageNet dataset. The experimental results show that our framework can generate reasonable network partitions and reduce the storage on mobile devices with trivial loss of accuracy.
Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy
As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suffers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent's effectiveness in enabling on-demand low-latency edge intelligence.