Goto

Collaborating Authors

 contention




ELK: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques

Liu, Yiqi, Xue, Yuqi, Crawford, Noelle, Xue, Jilong, Huang, Jian

arXiv.org Artificial Intelligence

To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these inter-core connected AI (ICCA) chips, due to a fundamental tussle among compute (per-core execution), communication (inter-core data exchange), and I/O (off-chip data access). In this paper, we develop Elk, a DL compiler framework to maximize the efficiency of ICCA chips by jointly trading off all the three performance factors discussed above. Elk structures these performance factors into configurable parameters and forms a global trade-off space in the DL compiler. To systematically explore this space and maximize overall efficiency, Elk employs a new inductive operator scheduling policy and a cost-aware on-chip memory allocation algorithm. It generates globally optimized execution plans that best overlap off-chip data loading and on-chip execution. To examine the efficiency of Elk, we build a full-fledged emulator based on a real ICCA chip IPU-POD4, and an ICCA chip simulator for sensitivity analysis with different interconnect network topologies. Elk achieves 94% of the ideal roofline performance of ICCA chips on average, showing the benefits of supporting large DL models on ICCA chips. We also show Elk's capability of enabling architecture design space exploration for new ICCA chip development.


Distributed Link Sparsification for Scalable Scheduling Using Graph Neural Networks (Journal Version)

Zhao, Zhongyuan, Verma, Gunjan, Swami, Ananthram, Segarra, Santiago

arXiv.org Artificial Intelligence

--In wireless networks characterized by dense connectivity, the significant signaling overhead generated by distributed link scheduling algorithms can exacerbate issues like congestion, energy consumption, and radio footprint expansion. T o mitigate these challenges, we propose a distributed link sparsification scheme employing graph neural networks (GNNs) to reduce scheduling overhead for delay-tolerant traffic while maintaining network capacity. A GNN module is trained to adjust contention thresholds for individual links based on traffic statistics and network topology, enabling links to withdraw from scheduling contention when they are unlikely to succeed. Our approach is facilitated by a novel offline constrained unsupervised learning algorithm capable of balancing two competing objectives: minimizing scheduling overhead while ensuring that total utility meets the required level. In simulated wireless multi-hop networks with up to 500 links, our link sparsification technique effectively alleviates network congestion and reduces radio footprints across four distinct distributed link scheduling protocols. Index T erms --Threshold, massive access, scalable scheduling, graph neural networks, constrained unsupervised learning. The proliferation of wireless devices and emerging machine-type communications (MTC) [2] has led to new requirements for next-generation wireless networks, including massive access in ultra-dense networks, spectrum and energy efficiencies, multi-hop connectivity, and scalability [3]-[6]. A promising solution to these challenges is self-organizing wireless multi-hop networks, which have been applied to scenarios where infrastructure is infeasible or overloaded, such as military communications, satellite communications, vehicular/drone networks, Internet of Things (IoT), and 5G/6G (device-to-device (D2D), wireless backhaul, integrated access and backhaul (IAB)) [3]-[10]. Received 27 February 2024; revised 20 January 2025, 17 June 2025, and 13 August 2025; accepted 1 September 2025. Research was sponsored by the DEVCOM ARL Army Research Office and was accomplished under Cooperative Agreement Number W911NF-19-2-0269. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. Zhongyuan Zhao and Santiago Segarra are with the Department of Electrical and Computer Engineering, Rice University, USA.


Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

Shi, Xiaoxiang, Cai, Colin, Du, Junjia, Jia, Zhihao

arXiv.org Artificial Intelligence

Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns -- beyond a saturation point, more allocation yields minimal latency benefit. Second, we observe that memory bandwidth contention becomes a critical bottleneck. These insights motivate a design that dynamically partitions GPU resources across prefill and decode phases, while jointly considering compute capacity, memory footprint, and bandwidth contention. Evaluated on diverse LLMs and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SGLang by up to 2x; and matches or exceeds disaggregated vLLM.



Collaborative Min-Max Regret in Grouped Multi-Armed Bandits

Blanchard, Moïse, Goyal, Vineet

arXiv.org Machine Learning

This model has wide-ranging applications including recommender systems that sequentially offer products recommendations to users to maximize long-term revenue [LCLS10], or clinical trials in which the goal is to find the best treatment for a sequence of patients [Tho33]. A key aspect of multi-armed bandits is that to maximize long-term reward, one must balance between exploration (acquiring more information about suboptimal actions to potentially improve future decisions) and exploitation (following the best actions given the current information). Exploration naturally comes at a cost for users, which can disproportionately impact certain groups of users. This raises important questions about when and how the exploration burden can be alleviated for these groups [RSWW18, JKL20]. Notably, [BF24] showed that in the asymptotic regime, the exploration cost can be shared in an arbitrarily unfair manner between groups for standard learning policies and proposed a Nash-bargaining solution to alleviate this issue. As an equivalent perspective on the problem, we can consider a setting in which several agents face their own multi-armed bandit problem, e.g., groups of recommenders targeting different populations. The goal is then to understand when and how collaborative exploration can be beneficial as opposed to each group solving their problem individually without sharing information. We focus on heterogeneity between agents or groups in terms of their set of available actions, which corresponds to the so-called grouped bandit setting [BF24].


Disaggregating Embedding Recommendation Systems with FlexEMR

Huang, Yibo, Yang, Zhenning, Xing, Jiarong, Dai, Yi, Qiu, Yiming, Wu, Dingming, Lai, Fan, Chen, Ang

arXiv.org Artificial Intelligence

Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.


A Collaborative PIM Computing Optimization Framework for Multi-Tenant DNN

Li, Bojing, Zhong, Duo, Chen, Xiang, Liu, Chenchen

arXiv.org Artificial Intelligence

Modern Artificial Intelligence (AI) applications are increasingly utilizing multi-tenant deep neural networks (DNNs), which lead to a significant rise in computing complexity and the need for computing parallelism. ReRAM-based processing-in-memory (PIM) computing, with its high density and low power consumption characteristics, holds promising potential for supporting the deployment of multi-tenant DNNs. However, direct deployment of complex multi-tenant DNNs on exsiting ReRAM-based PIM designs poses challenges. Resource contention among different tenants can result in sever under-utilization of on-chip computing resources. Moreover, area-intensive operators and computation-intensive operators require excessively large on-chip areas and long processing times, leading to high overall latency during parallel computing. To address these challenges, we propose a novel ReRAM-based in-memory computing framework that enables efficient deployment of multi-tenant DNNs on ReRAM-based PIM designs. Our approach tackles the resource contention problems by iteratively partitioning the PIM hardware at tenant level. In addition, we construct a fine-grained reconstructed processing pipeline at the operator level to handle area-intensive operators. Compared to the direct deployments on traditional ReRAM-based PIM designs, our proposed PIM computing framework achieves significant improvements in speed (ranges from 1.75x to 60.43x) and energy(up to 1.89x).


Group Related Phenomena in Wikipedia Edits

Burgess, M., Dunbar, R. I. M.

arXiv.org Artificial Intelligence

Human communities have self-organizing properties that give rise to very specific natural grouping patterns, reflected in the Dunbar Number and its layered structure (a Dunbar Graph). Since work-groups are necessarily also social groups, we might expect the same principles to apply here as well. One factor likely to be important in limiting the size of groups is that conflicts typically escalate with the number of people involved. Here we analyse Wikipedia editing histories across a wide range of topics to show that there is an emergent coherence in the size of groups formed transiently to edit the content of subject texts, with two peaks averaging at around $N=8$ for the size corresponding to maximal contention, and at around $N=4$ as a regular team. These values are consistent with the observed sizes of conversational groups, as well as the hierarchical structuring of Dunbar graphs. We use the Promise Theory of trust to suggest a scaling law that may apply to all group distributions based on seeded attraction. In addition to providing further evidence that even natural communities of strangers are self-organising, the results have important implications for the governance of the Wikipedia commons and for the security of all online social platforms and associations.