Goto

Collaborating Authors

 Bai, Xiaolong


Efficiently serving large multimedia models using EPD Disaggregation

arXiv.org Artificial Intelligence

Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step helps convert raw inputs into tokenized representations that inflate the token sequence for the prefill phase, negatively impacting key Service Level Objectives (SLOs) like time to first token (TTFT) and end-to-end throughput. We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our disaggregation approach alleviates memory bottlenecks, mitigates synchronization delays, and supports flexible batching. Specifically, we employ a new caching mechanism for multimodal tokens, enabling asynchronous transfer of multimodal tokens and introduce an integrated module to find optimal config for EPD system and minimize resource usage while maximizing SLO-based performance metric. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15$\times$ lesser for encoding-stage GPUs), that supports upto 22$\times$ higher batch sizes, 10$\times$ more number of images/ request, 2.2$\times$ higher kv cache size. Further, it leads to significant improvements in end-to-end throughput (up to 57\% better), and latency metrics (TTFT up to 71\% lower), compared to systems that do not disaggregate. Our findings underscore the potential of EPD disaggregation to enable resource-efficient and high-performance multimodal inference at scale.


Look Before You Leap: Safe Model-Based Reinforcement Learning with Human Intervention

arXiv.org Artificial Intelligence

Safety has become one of the main challenges of applying deep reinforcement learning to real world systems. Currently, the incorporation of external knowledge such as human oversight is the only means to prevent the agent from visiting the catastrophic state. In this paper, we propose MBHI, a novel framework for safe model-based reinforcement learning, which ensures safety in the state-level and can effectively avoid both "local" and "non-local" catastrophes. An ensemble of supervised learners are trained in MBHI to imitate human blocking decisions. Similar to human decision-making process, MBHI will roll out an imagined trajectory in the dynamics model before executing actions to the environment, and estimate its safety. When the imagination encounters a catastrophe, MBHI will block the current action and use an efficient MPC method to output a safety policy. We evaluate our method on several safety tasks, and the results show that MBHI achieved better performance in terms of sample efficiency and number of catastrophes compared to the baselines.


SimROD: A Simple Adaptation Method for Robust Object Detection

arXiv.org Artificial Intelligence

This paper presents a Simple and effective unsupervised adaptation method for Robust Object Detection (SimROD). To overcome the challenging issues of domain shift and pseudo-label noise, our method integrates a novel domain-centric augmentation method, a gradual self-labeling adaptation procedure, and a teacher-guided fine-tuning mechanism. Using our method, target domain samples can be leveraged to adapt object detection models without changing the model architecture or generating synthetic data. When applied to image corruptions and high-level cross-domain adaptation benchmarks, our method outperforms prior baselines on multiple domain adaptation benchmarks. SimROD achieves new state-of-the-art on standard real-to-synthetic and cross-camera setup benchmarks. On the image corruption benchmark, models adapted with our method achieved a relative robustness improvement of 15-25% AP50 on Pascal-C and 5-6% AP on COCO-C and Cityscapes-C. On the cross-domain benchmark, our method outperformed the best baseline performance by up to 8% AP50 on Comic dataset and up to 4% on Watercolor dataset.