Planning & Scheduling
Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters
Zeng, Lingling, Zhang, Gen, Peng, Jialin, Xu, Xiang, Xu, Yuan, Ma, Lijun
As AI cluster sizes continue to expand and the demand for large-language-model (LLM) training and inference workloads grows rapidly, traditional scheduling systems face significant challenges in balancing resource utilization, scheduling efficiency, and service quality. This paper presents and evaluates Kant: an efficient unified scheduling platform designed for large-scale AI container clusters, supporting the co-scheduling of both training and inference jobs. Based on the practical implementation of the Kant system, we systematically define a set of key evaluation metrics for AI clusters, including GPU Allocation Ratio (GAR), Scheduling Occupancy Rate (SOR), GPU Node Fragmentation Ratio (GFR), Job Waiting Time Distribution (JWTD), and Job Training Time Estimation Distribution (JTTED), providing a foundation for quantitative performance analysis. Experimental results demonstrate that Kant achieves exceptional performance in clusters ranging from hundreds to tens of thousands of GPUs. By leveraging scheduling strategies such as Backfill and Enhanced Binpack (E-Binpack), the system significantly improves resource utilization and scheduling efficiency, while effectively reducing resource fragmentation and communication overhead in distributed training. The system has been deployed in multiple AI data center clusters, where it stably supports large-scale intelligent computing workloads. This work provides a practical engineering approach for building high-performance, highly available, AI-native scheduling infrastructure.
Attractor Network Dynamics Enable Preplay and Rapid Path Planning in Mazeโlike Environments
Dane S. Corneil, Wulfram Gerstner
Rodents navigating in a well-known environment can rapidly learn and revisit observed reward locations, often after a single trial. While the mechanism for rapid path planning is unknown, the CA3 region in the hippocampus plays an important role, and emerging evidence suggests that place cell activity during hippocam-pal "preplay" periods may trace out future goal-directed trajectories. Here, we show how a particular mapping of space allows for the immediate generation of trajectories between arbitrary start and goal locations in an environment, based only on the mapped representation of the goal. We show that this representation can be implemented in a neural attractor network model, resulting in bump-like activity profiles resembling those of the CA3 region of hippocampus. Neurons tend to locally excite neurons with similar place field centers, while inhibiting other neurons with distant place field centers, such that stable bumps of activity can form at arbitrary locations in the environment. The network is initialized to represent a point in the environment, then weakly stimulated with an input corresponding to an arbitrary goal location. We show that the resulting activity can be interpreted as a gradient ascent on the value function induced by a reward at the goal location. Indeed, in networks with large place fields, we show that the network properties cause the bump to move smoothly from its initial location to the goal, around obstacles or walls. Our results illustrate that an attractor network with hippocampal-like attributes may be important for rapid path planning.
Supplementary Materials for " nullnullnullnullnullnullnullnullnull: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic Analysis " A Algorithm Details
Output: action to take a. Parameters: maximum depth H allowed in HOO. L covers all the nodes in T . We start with the concentration property and then the convergence results. To proceed further, we first need to state several definitions that are useful throughout. We reproduce these definitions here for completeness.