Goto

Collaborating Authors

 vms


SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference

Zhang, Zongshun, Matta, Ibrahim

arXiv.org Artificial Intelligence

Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.


Multi-Modal View Enhanced Large Vision Models for Long-Term Time Series Forecasting

Shen, ChengAo, Yu, Wenchao, Zhao, Ziming, Song, Dongjin, Cheng, Wei, Chen, Haifeng, Ni, Jingchao

arXiv.org Artificial Intelligence

Time series, typically represented as numerical sequences, can also be transformed into images and texts, offering multi-modal views (MMVs) of the same underlying signal. These MMVs can reveal complementary patterns and enable the use of powerful pre-trained large models, such as large vision models (LVMs), for long-term time series forecasting (LTSF). However, as we identified in this work, the state-of-the-art (SOTA) LVM-based forecaster poses an inductive bias towards "forecasting periods". To harness this bias, we propose DMMV, a novel decomposition-based multi-modal view framework that leverages trend-seasonal decomposition and a novel backcast-residual based adaptive decomposition to integrate MMVs for LTSF. Comparative evaluations against 14 SOTA models across diverse datasets show that DMMV outperforms single-view and existing multi-modal baselines, achieving the best mean squared error (MSE) on 6 out of 8 benchmark datasets. The code for this paper is available at: https://github.com/D2I-Group/dmmv.


GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy

Shen, Ya, Chen, Gang, Ma, Hui, Zhang, Mengjie

arXiv.org Artificial Intelligence

Cost-aware Dynamic Workflow Scheduling (CADWS) is a key challenge in cloud computing, focusing on devising an effective scheduling policy to efficiently schedule dynamically arriving workflow tasks, represented as Directed Acyclic Graphs (DAG), to suitable virtual machines (VMs). Deep reinforcement learning (DRL) has been widely employed for automated scheduling policy design. However, the performance of DRL is heavily influenced by the design of the problem-tailored policy network and is highly sensitive to hyperparameters and the design of reward feedback. Considering the above-mentioned issues, this study proposes a novel DRL method combining Graph Attention Networks-based policy network and Evolution Strategy, referred to as GATES. The contributions of GATES are summarized as follows: (1) GATES can capture the impact of current task scheduling on subsequent tasks by learning the topological relationships between tasks in a DAG. (2) GATES can assess the importance of each VM to the ready task, enabling it to adapt to dynamically changing VM resources. (3) Utilizing Evolution Strategy's robustness, exploratory nature, and tolerance for delayed rewards, GATES achieves stable policy learning in CADWS. Extensive experimental results demonstrate the superiority of the proposed GATES in CADWS, outperforming several state-of-the-art algorithms. The source code is available at: https://github.com/YaShen998/GATES.


KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control

Han, Jinrui, Xie, Weiji, Zheng, Jiakun, Shi, Jiyuan, Zhang, Weinan, Xiao, Ting, Bai, Chenjia

arXiv.org Artificial Intelligence

We deploy VMS on the Unitree G1 humanoid robot, demonstrating its capability to perform a broad category of motion skills with strong stability and generalization. The repertoire includes (a) walking and running, (b) ball throwing and racket swinging, (c) dancing, (d) diverse kicking, (e) Kung Fu and (f) long sequences of martial arts and dance. Abstract-- Learning versatile whole-body skills by tracking various human motions is a fundamental step toward general-purpose humanoid robots. This task is particularly challenging because a single policy must master a broad repertoire of motion skills while ensuring stability over long-horizon sequences. T o this end, we present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy. Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency, and an Orthogonal Mixture-of-Experts (OMoE) architecture that encourages skill specialization while enhancing generalization across motions. A segment-level tracking reward is further introduced to relax rigid step-wise matching, enhancing robustness when handling global displacements and transient inaccuracies.


From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

Zhao, Ziming, Shen, ChengAo, Tong, Hanghang, Song, Dongjin, Deng, Zhigang, Wen, Qingsong, Ni, Jingchao

arXiv.org Artificial Intelligence

Transformer-based models have gained increasing attention in time series research, driving interest in Large Language Models (LLMs) and foundation models for time series analysis. As the field moves toward multi-modality, Large Vision Models (LVMs) are emerging as a promising direction. In the past, the effectiveness of Transformer and LLMs in time series has been debated. When it comes to LVMs, a similar question arises: are LVMs truely useful for time series analysis? To address it, we design and conduct the first principled study involving 4 LVMs, 8 imaging methods, 18 datasets and 26 baselines across both high-level (classification) and low-level (forecasting) tasks, with extensive ablation analysis. Our findings indicate LVMs are indeed useful for time series classification but face challenges in forecasting. Although effective, the contemporary best LVM forecasters are limited to specific types of LVMs and imaging methods, exhibit a bias toward forecasting periods, and have limited ability to utilize long look-back windows. We hope our findings could serve as a cornerstone for future research on LVM- and multimodal-based solutions to different time series tasks.


LVM4CSI: Enabling Direct Application of Pre-Trained Large Vision Models for Wireless Channel Tasks

Guo, Jiajia, Jiang, Peiwen, Wen, Chao-Kai, Jin, Shi, Zhang, Jun

arXiv.org Artificial Intelligence

Accurate channel state information (CSI) is critical to the performance of wireless communication systems, especially with the increasing scale and complexity introduced by 5G and future 6G technologies. While artificial intelligence (AI) offers a promising approach to CSI acquisition and utilization, existing methods largely depend on task-specific neural networks (NNs) that require expert-driven design and large training datasets, limiting their generalizability and practicality. To address these challenges, we propose LVM4CSI, a general and efficient framework that leverages the structural similarity between CSI and computer vision (CV) data to directly apply large vision models (LVMs) pre-trained on extensive CV datasets to wireless tasks without any fine-tuning, in contrast to large language model-based methods that generally necessitate fine-tuning. LVM4CSI maps CSI tasks to analogous CV tasks, transforms complex-valued CSI into visual formats compatible with LVMs, and integrates lightweight trainable layers to adapt extracted features to specific communication objectives. We validate LVM4CSI through three representative case studies, including channel estimation, human activity recognition, and user localization. Results demonstrate that LVM4CSI achieves comparable or superior performance to task-specific NNs, including an improvement exceeding 9.61 dB in channel estimation and approximately 40% reduction in localization error. Furthermore, it significantly reduces the number of trainable parameters and eliminates the need for task-specific NN design.


Towards VM Rescheduling Optimization Through Deep Reinforcement Learning

Ding, Xianzhong, Zhang, Yunkai, Chen, Binbin, Ying, Donghao, Zhang, Tieying, Chen, Jianjun, Zhang, Lei, Cerpa, Alberto, Du, Wan

arXiv.org Artificial Intelligence

Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: https://github.com/zhykoties/VMR2L_eurosys, https://drive.google.com/drive/folders/1PfRo1cVwuhH30XhsE2Np3xqJn2GpX5qy.


Symmetry-Preserving Architecture for Multi-NUMA Environments (SPANE): A Deep Reinforcement Learning Approach for Dynamic VM Scheduling

Chan, Tin Ping, Cheng, Yunlong, Zhu, Yizhan, Gao, Xiaofeng, Chen, Guihai

arXiv.org Artificial Intelligence

As cloud computing continues to evolve, the adoption of multi-NUMA (Non-Uniform Memory Access) architecture by cloud service providers has introduced new challenges in virtual machine (VM) scheduling. To address these challenges and more accurately reflect the complexities faced by modern cloud environments, we introduce the Dynamic VM Allocation problem in Multi-NUMA PM (DVAMP). We formally define both offline and online versions of DVAMP as mixed-integer linear programming problems, providing a rigorous mathematical foundation for analysis. A tight performance bound for greedy online algorithms is derived, offering insights into the worst-case optimality gap as a function of the number of physical machines and VM lifetime variability. To address the challenges posed by DVAMP, we propose SPANE (Symmetry-Preserving Architecture for Multi-NUMA Environments), a novel deep reinforcement learning approach that exploits the problem's inherent symmetries. SPANE produces invariant results under arbitrary permutations of physical machine states, enhancing learning efficiency and solution quality. Extensive experiments conducted on the Huawei-East-1 dataset demonstrate that SPANE outperforms existing baselines, reducing average VM wait time by 45%. Our work contributes to the field of cloud resource management by providing both theoretical insights and practical solutions for VM scheduling in multi-NUMA environments, addressing a critical gap in the literature and offering improved performance for real-world cloud systems.


External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation

Liang, Mingfu, Liu, Xi, Jin, Rong, Liu, Boyang, Suo, Qiuling, Zhou, Qinghai, Zhou, Song, Chen, Laming, Zheng, Hua, Li, Zhiyuan, Jiang, Shali, Yang, Jiyan, Xia, Xiaozhen, Yang, Fan, Badr, Yasmine, Wen, Ellie, Xu, Shuyu, Chen, Hansey, Zhang, Zhengyu, Nie, Jade, Yang, Chunzhi, Zeng, Zhichen, Zhang, Weilin, Huang, Xingliang, Li, Qianru, Wang, Shiquan, Lyu, Evelyn, Lu, Wenjing, Zhang, Rui, Wang, Wenjun, Rudy, Jason, Hang, Mengyue, Wang, Kai, Ma, Yinbin, Wang, Shuaiwen, Zeng, Sihan, Tang, Tongyi, Wei, Xiaohan, Jin, Longhao, Zhang, Jamey, Chen, Marcus, Zhang, Jiayi, Huang, Angie, Zhang, Chi, Zhao, Zhengli, Yang, Jared, Jin, Qiang, Chen, Xian, Amlesahwaram, Amit Anand, Song, Lexi, Luo, Liang, Hao, Yuchen, Xiao, Nan, Yetim, Yavuz, Pan, Luoshang, Liu, Gaoxiang, Hu, Yuxi, Huang, Yuzhen, Xu, Jackie, Zhu, Rich, Zhang, Xin, Liu, Yiqun, Yin, Hang, Chen, Yuxin, Zhang, Buyun, Liu, Xiaoyi, Wang, Xingyuan, Mao, Wenguang, Li, Zhijing, Huang, Qin, Sun, Chonglin, Yu, Nancy, Gu, Shuo, Mao, Shupin, Au, Benjamin, Qin, Jingzheng, Yao, Peggy, Choi, Jae-Woo, Gao, Bin, Wang, Ernest, Zhang, Lei, Chen, Wen-Yen, Lee, Ted, Zha, Jay, Meng, Yi, Gong, Alex, Gao, Edison, Vahdatpour, Alireza, Han, Yiping, Yao, Yantao, Kureha, Toshinari, Chang, Shuo, Sultan, Musharaf, Bocharov, John, Chordia, Sagar, Gan, Xiaorui, Sun, Peng, Liu, Rocky, Long, Bo, Chen, Wenlin, Kolay, Santanu, Li, Huayu

arXiv.org Artificial Intelligence

Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.


TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

Stojkovic, Jovan, Zhang, Chaojie, Goiri, Íñigo, Choukse, Esha, Qiu, Haoran, Fonseca, Rodrigo, Torrellas, Josep, Bianchini, Ricardo

arXiv.org Artificial Intelligence

The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.