Goto

Collaborating Authors

 average latency



PASS-Enhanced MEC: Joint Optimization of Task Offloading and Uplink PASS Beamforming

Hu, Zhaoming, Zhong, Ruikang, Mu, Xidong, Li, Dengao, Liu, Yuanwei

arXiv.org Artificial Intelligence

A pinching-antenna system (PASS)-enhanced mobile edge computing (MEC) architecture is investigated to improve the task offloading efficiency and latency performance in dynamic wireless environments. By leveraging dielectric waveguides and flexibly adjustable pinching antennas, PASS establishes short-distance line-of-sight (LoS) links while effectively mitigating the significant path loss and potential signal blockage, making it a promising solution for high-frequency MEC systems. We formulate a network latency minimization problem to joint optimize uplink PASS beamforming and task offloading. The resulting problem is modeled as a Markov decision process (MDP) and solved via the deep reinforcement learning (DRL) method. To address the instability introduced by the $\max$ operator in the objective function, we propose a load balancing-aware proximal policy optimization (LBPPO) algorithm. LBPPO incorporates both node-level and waveguide-level load balancing information into the policy design, maintaining computational and transmission delay equilibrium, respectively. Simulation results demonstrate that the proposed PASS-enhanced MEC with adaptive uplink PASS beamforming exhibit stronger convergence capability than fixed-PA baselines and conventional MIMO-assisted MEC, especially in scenarios with a large number of UEs or high transmit power.


Accelerating Non-Maximum Suppression: A Graph Theory Perspective King-Siong Si1* Lu Sun

Neural Information Processing Systems

Non-maximum suppression (NMS) is an indispensable post-processing step in object detection. With the continuous optimization of network models, NMS has become the "last mile" to enhance the efficiency of object detection.


Probabilistic Latency Analysis of the Data Distribution Service in ROS 2

Lee, Sanghoon, Park, Hyung-Seok, Chae, Jiyeong, Park, Kyung-Joon

arXiv.org Artificial Intelligence

--Robot Operating System 2 (ROS 2) is now the de-facto standard for robotic communication, pairing UDP transport with the Data Distribution Service (DDS) publish-subscribe middleware. DDS achieves reliability through periodic heartbeats that solicit acknowledgments for missing samples and trigger selective retransmissions. In lossy wireless networks, the tight coupling among heartbeat period, IP fragmentation, and retransmission interval obscures end-to-end latency behavior and leaves practitioners with little guidance on how to tune these parameters. T o address these challenges, we propose a probabilistic latency analysis (PLA) that analytically models the reliable transmission process of ROS 2 DDS communication using a discrete-state approach. By systematically analyzing both middleware-level and transport-level events, PLA computes the steady-state probability distribution of unacknowledged messages and the retransmission latency. Our findings establish a theoretical basis to systematically optimize reliability, latency, and performance in wireless industrial robotics. Communication has become an increasingly critical factor in modern robotics. Conventional fixed-station robots have long relied on wired links-such as Ethernet-based fieldbuses-for control and data exchange, benefiting from their stability and low latency. For mobile robots and multi-robot systems, however, the need to cut the tether and adopt wireless communication is rapidly growing [1].


Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts

Yang, Jin, Wu, Qiong, Feng, Zhiying, Zhou, Zhi, Guo, Deke, Chen, Xu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services. However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns. Therefore, multiple LLMs are usually deployed at the network edge to boost real-time responsiveness and protect data privacy, particularly for many emerging smart mobile and IoT applications. Given the varying response quality and latency of LLM services, a critical issue is how to route user requests from mobile and IoT devices to an appropriate LLM service (i.e., edge LLM expert) to ensure acceptable quality-of-service (QoS). Existing routing algorithms fail to simultaneously address the heterogeneity of LLM services, the interference among requests, and the dynamic workloads necessary for maintaining long-term stable QoS. To meet these challenges, in this paper we propose a novel deep reinforcement learning (DRL)-based QoS-aware LLM routing framework for sustained high-quality LLM services. Due to the dynamic nature of the global state, we propose a dynamic state abstraction technique to compactly represent global state features with a heterogeneous graph attention network (HAN). Additionally, we introduce an action impact estimator and a tailored reward function to guide the DRL agent in maximizing QoS and preventing latency violations. Extensive experiments on both Poisson and real-world workloads demonstrate that our proposed algorithm significantly improves average QoS and computing resource efficiency compared to existing baselines.


Online Scheduling for LLM Inference with KV Cache Constraints

Jaillet, Patrick, Jiang, Jiashuo, Podimata, Chara, Zhou, Zijie

arXiv.org Artificial Intelligence

Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose novel batching and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory. We analyze both semi-online and fully online scheduling models, and our results are threefold. First, we provide a polynomial-time algorithm that achieves exact optimality in terms of average latency in the semi-online prompt arrival model. Second, in the fully online case with a stochastic prompt arrival, we introduce an efficient online scheduling algorithm with constant regret. Third, we prove that no algorithm (deterministic or randomized) can achieve a constant competitive ratio in fully online adversarial settings. Our empirical evaluations on a public LLM inference dataset, using the Llama-70B model on A100 GPUs, show that our approach significantly outperforms benchmark algorithms used currently in practice, achieving lower latency while reducing energy consumption. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.


Accelerating Non-Maximum Suppression: A Graph Theory Perspective

Si, King-Siong, Sun, Lu, Zhang, Weizhan, Gong, Tieliang, Wang, Jiahao, Liu, Jiang, Sun, Hao

arXiv.org Artificial Intelligence

Non-maximum suppression (NMS) is an indispensable post-processing step in object detection. With the continuous optimization of network models, NMS has become the ``last mile'' to enhance the efficiency of object detection. This paper systematically analyzes NMS from a graph theory perspective for the first time, revealing its intrinsic structure. Consequently, we propose two optimization methods, namely QSI-NMS and BOE-NMS. The former is a fast recursive divide-and-conquer algorithm with negligible mAP loss, and its extended version (eQSI-NMS) achieves optimal complexity of $\mathcal{O}(n\log n)$. The latter, concentrating on the locality of NMS, achieves an optimization at a constant level without an mAP loss penalty. Moreover, to facilitate rapid evaluation of NMS methods for researchers, we introduce NMS-Bench, the first benchmark designed to comprehensively assess various NMS methods. Taking the YOLOv8-N model on MS COCO 2017 as the benchmark setup, our method QSI-NMS provides $6.2\times$ speed of original NMS on the benchmark, with a $0.1\%$ decrease in mAP. The optimal eQSI-NMS, with only a $0.3\%$ mAP decrease, achieves $10.7\times$ speed. Meanwhile, BOE-NMS exhibits $5.1\times$ speed with no compromise in mAP.


Straggler-Resilient Differentially-Private Decentralized Learning

Yakimenka, Yauhen, Weng, Chung-Wei, Lin, Hsuan-Yin, Rosnes, Eirik, Kliewer, Jörg

arXiv.org Artificial Intelligence

We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and the DP level are derived for both a skipping scheme (which ignores the stragglers after a timeout) and a baseline scheme that waits for each node to finish before the training continues. A trade-off between overall training latency, accuracy, and privacy, parameterized by the timeout of the skipping scheme, is identified and empirically validated for logistic regression on a real-world dataset and for image classification using the MNIST and CIFAR-10 datasets.


The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Su, Qidong, Giannoula, Christina, Pekhimenko, Gennady

arXiv.org Artificial Intelligence

Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To study their synergy, we implement a prototype implementation and perform an extensive characterization analysis on various LLM models and GPU architectures. We observe that the optimal speculation length depends on the batch size used. We analyze the key observation and build a quantitative model to explain it. Based on our analysis, we propose a new adaptive speculative decoding strategy that chooses the optimal speculation length for different batch sizes. Our evaluations show that our proposed method can achieve equal or better performance than the state-of-the-art speculation decoding schemes with fixed speculation length.


CloudGripper: An Open Source Cloud Robotics Testbed for Robotic Manipulation Research, Benchmarking and Data Collection at Scale

Zahid, Muhammad, Pokorny, Florian T.

arXiv.org Artificial Intelligence

We present CloudGripper, an open source cloud robotics testbed, consisting of a scalable, space and cost-efficient design constructed as a rack of 32 small robot arm work cells. Each robot work cell is fully enclosed and features individual lighting, a low-cost custom 5 degree of freedom Cartesian robot arm with an attached parallel jaw gripper and a dual camera setup for experimentation. The system design is focused on continuous operation and features a 10 Gbit/s network connectivity allowing for high throughput remote-controlled experimentation and data collection for robotic manipulation. CloudGripper furthermore is intended to form a community testbed to study the challenges of large scale machine learning and cloud and edge-computing in the context of robotic manipulation. In this work, we describe the mechanical design of the system, its initial software stack and evaluate the repeatability of motions executed by the proposed robot arm design. A local network API throughput and latency analysis is also provided. CloudGripper-Rope-100, a dataset of more than a hundred hours of randomized rope pushing interactions and approximately 4 million camera images is collected and serves as a proof of concept demonstrating data collection capabilities. A project website with more information is available at https://cloudgripper.org.