Goto

Collaborating Authors

 sms


ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Sul, Stuart H., Arora, Simran, Spector, Benjamin F., Ré, Christopher

arXiv.org Artificial Intelligence

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance$\unicode{x2014}$data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33 \times$ speedup for data- and tensor-parallel workloads, $4.08 \times$ for sequence-parallel workloads, and $1.22 \times$ for expert-parallel workloads.


Stochastic Mean-Shift Clustering

Lapidot, Itshak, Sepulcre, Yann, Trigano, Tom

arXiv.org Artificial Intelligence

Numerous algorithms have been proposed and investigated, among which the k means [1], Spectral clustering [2, 3], DB-SCAN [4], and the well-known Mean-shift (MS) clustering algorithm. MS is an effective non-parametric iterative algorithm [5], which is versatile for clustering, tracking, and smoothing tasks. A well-known and used variant of MS is the blurring mean-shift (BMS) [6]. Both MS and BMS algorithms can be coined "deterministic" iterative procedures aiming to find local maximiz-ers of an objective function, since they do not involve any random selection of points to perform their update rule. Both MS and BMS algorithms have been applied to a variety of domains, and several variations around their original formulation have been proposed: see [7] for BMS with a Gaussian kernel (known as Gaussian blurring mean-shift); for BMS applied to high-dimensional data clustering see [8].


TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Gond, Raja, Kwatra, Nipun, Ramjee, Ramachandran

arXiv.org Artificial Intelligence

Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce--RMSNorm kernel that carefully leverages Multimem instruction support available on Hopper and Blackwell NVIDIA GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.


Toward Interpretable Evaluation Measures for Time Series Segmentation

Chavelli, Félix, Boniol, Paul, Thomazo, Michaël

arXiv.org Artificial Intelligence

Time series segmentation is a fundamental task in analyzing temporal data across various domains, from human activity recognition to energy monitoring. While numerous state-of-the-art methods have been developed to tackle this problem, the evaluation of their performance remains critically limited. Existing measures predominantly focus on change point accuracy or rely on point-based measures such as Adjusted Rand Index (ARI), which fail to capture the quality of the detected segments, ignore the nature of errors, and offer limited interpretability. In this paper, we address these shortcomings by introducing two novel evaluation measures: WARI (Weighted Adjusted Rand Index), that accounts for the position of segmentation errors, and SMS (State Matching Score), a fine-grained measure that identifies and scores four fundamental types of segmentation errors while allowing error-specific weighting. We empirically validate WARI and SMS on synthetic and real-world benchmarks, showing that they not only provide a more accurate assessment of segmentation quality but also uncover insights, such as error provenance and type, that are inaccessible with traditional measures.


SMS: Self-supervised Model Seeding for Verification of Machine Unlearning

Wang, Weiqi, Zhang, Chenhan, Tian, Zhiyi, Yu, Shui

arXiv.org Artificial Intelligence

Abstract--Many machine unlearning methods have been proposed recently to uphold users' right to be forgotten. However, offering users verification of their data removal post-unlearning is an important yet under-explored problem. Current verifications typically rely on backdooring, i.e., adding backdoored samples to influence model performance. Nevertheless, the backdoor methods can merely establish a connection between backdoored samples and models but fail to connect the backdoor with genuine samples. Thus, the backdoor removal can only confirm the unlearning of backdoored samples, not users' genuine samples, as genuine samples are independent of backdoored ones. In this paper, we propose a Self-supervised Model Seeding (SMS) scheme to provide unlearning verification for genuine samples. Unlike backdooring, SMS links user-specific seeds (such as users' unique indices), original samples, and models, thereby facilitating the verification of unlearning genuine samples. However, implementing SMS for unlearning verification presents two significant challenges. First, embedding the seeds into the service model while keeping them secret from the server requires a sophisticated approach. We address this by employing a self-supervised model seeding task, which learns the entire sample, including the seeds, into the model's latent space. Second, maintaining the utility of the original service model while ensuring the seeding effect requires a delicate balance. The effectiveness of the proposed SMS scheme is evaluated through extensive experiments on three representative datasets, utilizing various model architectures and exact and approximate unlearning benchmarks. The results demonstrate that SMS provides effective verification for genuine sample unlearning, effectively addressing the limitations of existing solutions. N recent years, numerous privacy regulations and laws, such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCP A) [1], have been introduced to safeguard individuals' data privacy. These legislations guarantee individuals the right to be forgotten, thus prompting a hot and attractive research topic, machine unlearning [2, 3, 4]. Machine unlearning aims to remove the trace of user-specified samples from the already-trained models, ensuring compliance with these privacy mandates.


Modeling and Control Framework for Autonomous Space Manipulator Handover Operations

Quevedo, Diego, Hudson, Sarah, Kim, Donghoon

arXiv.org Artificial Intelligence

Autonomous space robotics is poised to play a vital role in future space missions, particularly for In-space Servicing, Assembly, and Manufacturing (ISAM). A key capability in such missions is the Robot-to-Robot (R2R) handover of mission-critical objects. This work presents a dynamic model of a dual-arm space manipulator system and compares various tracking control laws. The key contributions of this work are the development of a cooperative manipulator dynamic model and the comparative analysis of control laws to support autonomous R2R handovers in ISAM scenarios. INTRODUCTION The global space industry has grown significantly over the past decade and is expected to continue expanding. In-space Servicing, Assembly, and Manufacturing (ISAM) is emerging as a transfor-mative approach to space operations.


Human sensory-musculoskeletal modeling and control of whole-body movements

Zuo, Chenhui, Lin, Guohao, Zhang, Chen, Zhuang, Shanning, Sui, Yanan

arXiv.org Artificial Intelligence

Coordinated human movement depends on the integration of multisensory inputs, sensorimotor transformation, and motor execution, as well as sensory feedback resulting from body-environment interaction. Building dynamic models of the sensory-musculoskeletal system is essential for understanding movement control and investigating human behaviours. Here, we report a human sensory-musculoskeletal model, termed SMS-Human, that integrates precise anatomical representations of bones, joints, and muscle-tendon units with multimodal sensory inputs involving visual, vestibular, proprioceptive, and tactile components. A stage-wise hierarchical deep reinforcement learning framework was developed to address the inherent challenges of high-dimensional control in musculoskeletal systems with integrated multisensory information. Using this framework, we demonstrated the simulation of three representative movement tasks, including bipedal locomotion, vision-guided object manipulation, and human-machine interaction during bicycling. Our results showed a close resemblance between natural and simulated human motor behaviours. The simulation also revealed musculoskeletal dynamics that could not be directly measured. This work sheds deeper insights into the sensorimotor dynamics of human movements, facilitates quantitative understanding of human behaviours in interactive contexts, and informs the design of systems with embodied intelligence.


Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks

Liu, Yang, Yan, Bingjie, Zou, Tianyuan, Zhang, Jianqing, Gu, Zixuan, Ding, Jianbing, Wang, Xidong, Li, Jingyi, Ye, Xiaozhou, Ouyang, Ye, Yang, Qiang, Zhang, Ya-Qin

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable capabilities, but they require vast amounts of data and computational resources. In contrast, smaller models (SMs), while less powerful, can be more efficient and tailored to specific domains. In this position paper, we argue that taking a collaborative approach, where large and small models work synergistically, can accelerate the adaptation of LLMs to private domains and unlock new potential in AI. We explore various strategies for model collaboration and identify potential challenges and opportunities. Building upon this, we advocate for industry-driven research that prioritizes multi-objective benchmarks on real-world private datasets and applications.


Balanced Image Stylization with Style Matching Score

Jiang, Yuxin, Jiang, Liming, Yang, Shuai, Liu, Jia-Wei, Tsang, Ivor, Shou, Mike Zheng

arXiv.org Artificial Intelligence

We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.


Electrical Load Forecasting over Multihop Smart Metering Networks with Federated Learning

Rahman, Ratun, Moriano, Pablo, Khan, Samee U., Nguyen, Dinh C.

arXiv.org Artificial Intelligence

--Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) record household energy data. Traditional machine learning (ML) methods are often employed for load forecasting but require data sharing which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method for high-quality load forecasting in metering networks. A meta-learning-based strategy is developed to address data heterogeneity at local SMs in the collaborative training of local load forecasting models. Moreover, to minimize the load forecasting delays in our PFL model, we study a new latency optimization problem based on optimal resource allocation at SMs. A theoretical convergence analysis is also conducted to provide insights into FL design for federated load forecasting. Extensive simulations from real-world datasets show that our method outperforms existing approaches in terms of better load forecasting and reduced operational latency costs. Electrical load forecasting is crucial for power management in smart grids. This service is mainly supported via advanced metering infrastructure, where smart meters (SMs) record household energy consumption and share this data to the server of utility company [2]. This enables utility providers to estimate future electricity demands and thereby bolster grid reliability. Conventional load-forecasting techniques in machine learning (ML) and deep learning (DL) techniques utilize pattern-finding abilities to predict future outcomes.