disaggregation
- North America > Canada > Alberta (0.14)
- Asia > Middle East > Jordan (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)
Efficient Long-context Language Model Training by Core Attention Disaggregation
Zhuang, Yonghao, Chen, Junda, Pang, Bo, Gu, Yi, Zhu, Yibo, Jiang, Yimin, Stoica, Ion, Xing, Eric, Zhang, Hao
We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (4 more...)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- Europe > United Kingdom > England (0.04)
- Asia > Middle East > Jordan (0.04)
Export Reviews, Discussions, Author Feedback and Meta-Reviews
If you google ``fully adapted particle filters'' you will find a lot more material. The authors have considered four different and all relevant application examples. The experimental section shows that the iFDM seems to work and that it can provide interesting results. The only comparison provided is against the FFBS-type algorithm, which we know will perform worse due to its construction. I know that it is a lot of work to implement other solutions to the problem, but if one were to do so it would probably provide an even better understanding of the performance of the model and it would be interesting to see the performance of existing solution to these problems. For example, for the multitarget tracking example, the simplest solution to this problem would probably be to use an extended Kalman filter together with nearest neighbour data association. Since your targets are very well separated I would expect this solution to perform quite well. It would be interesting to compare your performance against this simple standard solution. I have not worked with the cocktail party problem and the multiuser detection problems, but for the power disaggregation problem there are interesting solutions available, see for example the following NIPS paper (which is gaining some influence): Kolter, J. Z.; Batra, S.; and Ng, A. Y. Energy disaggregation via discriminative sparse coding.
Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
Shi, Xiaoxiang, Cai, Colin, Du, Junjia, Jia, Zhihao
Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns -- beyond a saturation point, more allocation yields minimal latency benefit. Second, we observe that memory bandwidth contention becomes a critical bottleneck. These insights motivate a design that dynamically partitions GPU resources across prefill and decode phases, while jointly considering compute capacity, memory footprint, and bandwidth contention. Evaluated on diverse LLMs and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SGLang by up to 2x; and matches or exceeds disaggregated vLLM.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > California > San Diego County > Carlsbad (0.04)
- (9 more...)
Frontier: Simulating the Next Generation of LLM Inference Systems
Feng, Yicheng, Tan, Xin, Sew, Kin Hang, Jiang, Yimin, Zhu, Yibo, Xu, Hong
Large Language Model (LLM) inference is growing increasingly complex with the rise of Mixture-of-Experts (MoE) models and disaggregated architectures that decouple components like prefill/decode (PD) or attention/FFN (AF) for heterogeneous scaling. Existing simulators, architected for co-located, dense models, are unable to capture the intricate system dynamics of these emerging paradigms. We present Frontier, a high-fidelity simulator designed from the ground up for this new landscape. Frontier introduces a unified framework to model both co-located and disaggregated systems, providing native support for MoE inference with expert parallelism (EP). It enables the simulation of complex workflows like cross-cluster expert routing and advanced pipelining strategies for latency hiding. To ensure fidelity and usability, Frontier incorporates refined operator models for improved accuracy. Frontier empowers the community to design and optimize the future of LLM inference at scale.
NILMFormer: Non-Intrusive Load Monitoring that Accounts for Non-Stationarity
Petralia, Adrien, Charpentier, Philippe, Kadhi, Youssef, Palpanas, Themis
Millions of smart meters have been deployed worldwide, collecting the total power consumed by individual households. Based on these data, electricity suppliers offer their clients energy monitoring solutions to provide feedback on the consumption of their individual appliances. Historically, such estimates have relied on statistical methods that use coarse-grained total monthly consumption and static customer data, such as appliance ownership. Non-Intrusive Load Monitoring (NILM) is the problem of disaggregating a household's collected total power consumption to retrieve the consumed power for individual appliances. Current state-of-the-art (SotA) solutions for NILM are based on deep-learning (DL) and operate on subsequences of an entire household consumption reading. However, the non-stationary nature of real-world smart meter data leads to a drift in the data distribution within each segmented window, which significantly affects model performance. This paper introduces NILMFormer, a Transformer-based architecture that incorporates a new subsequence stationarization/de-stationarization scheme to mitigate the distribution drift and that uses a novel positional encoding that relies only on the subsequence's timestamp information. Experiments with 4 real-world datasets show that NILMFormer significantly outperforms the SotA approaches. Our solution has been deployed as the backbone algorithm for EDF's (Electricité De France) consumption monitoring service, delivering detailed insights to millions of customers about their individual appliances' power consumption. This paper appeared in KDD 2025.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- (7 more...)
- Transportation > Ground > Road (0.69)
- Energy > Power Industry > Utilities (0.66)
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Mitra, Tiyasa, Borkar, Ritika, Bhatia, Nidhi, Matas, Ramon, Raj, Shivam, Mudigere, Dheevatsa, Zhao, Ritchie, Golub, Maximilian, Dutta, Arpan, Madduri, Sailaja, Jani, Dharmesh, Pharris, Brian, Rouhani, Bita Darvish
As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
Season-Independent PV Disaggregation Using Multi-Scale Net Load Temporal Feature Extraction and Weather Factor Fusion
Chen, Xiaolu, Huang, Chenghao, Zhang, Yanru, Wang, Hao
--With the advancement of energy Internet and energy system integration, the increasing adoption of distributed photovoltaic (PV) systems presents new challenges on smart monitoring and measurement for utility companies, particularly in separating PV generation from net electricity load. This paper proposes a PV disaggregation method that integrates Hierarchical Interpolation (HI) and multi-head self-attention mechanisms. By using HI to extract net load features and multi-head self-attention to capture the complex dependencies between weather factors, the method achieves precise PV generation predictions. Simulation experiments demonstrate the effectiveness of the proposed method in real-world data, supporting improved monitoring and management of distributed energy systems. With the increasing adoption of distributed solar photovoltaic (PV) systems, an expanding number of residential prosumers, who both produce and consume electricity, are generating electricity through their PV installations.