Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms
Taufique, Zain, Vyas, Aman, Miele, Antonio, Liljeberg, Pasi, Kanduri, Anil
–arXiv.org Artificial Intelligence
--Compound AI (cAI) systems chain multiple AI models to solve complex problems. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and Dynamic V oltage/Frequency Scaling (DVFS), while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets. AI applications are rapidly evolving from monolithic models towards Compound Artificial Intelligence (cAI) systems, which integrate multiple task-specific models and components to solve complex problems [1]-[3]. Emerging cAI systems combine Large Language Models (LLMs) with Deep Neural Networks (DNNs) for providing novel services such as conversational language agents [2]-[5], augmented and virtual reality (AR/VR) gear, and interactive autonomous vehicles [6]. In this example, DNN models ( D1: VGG-19 and D2: ResNet-152) are used for image classification, and object detection, transformer models ( T1: Bert-base and T2: Bert-large) are used for text summarizing and classification, and generative transformers ( T3: OPT-350M and LLM: Deepseek-R1) are used for reasoning and report generation. Each model is responsible for extracting key features from the given input and sending the output to the subsequent models to perform collaborative tasks. T1, D1, and D2 are exclusive inference tasks that can run simultaneously, while T2, T3, and LLM are dependent on the outputs of other models. We deployed the exemplar cAI system on the Nvidia Jetson Orin NX platform.
arXiv.org Artificial Intelligence
Jul-2-2025
- Country:
- Europe
- North America > United States (0.04)
- Genre:
- Research Report (0.50)
- Industry:
- Information Technology (1.00)
- Technology: