Chen, Ang
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Kon, Patrick Tser Jern, Liu, Jiachen, Ding, Qiuyi, Qiu, Yiming, Yang, Zhenning, Huang, Yibo, Srinivasa, Jayanth, Lee, Myungjin, Chowdhury, Mosharaf, Chen, Ang
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.
Disaggregating Embedding Recommendation Systems with FlexEMR
Huang, Yibo, Yang, Zhenning, Xing, Jiarong, Dai, Yi, Qiu, Yiming, Wu, Dingming, Lai, Fan, Chen, Ang
Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance
Xing, Jiarong, Wang, Leyuan, Zhang, Shang, Chen, Jack, Chen, Ang, Zhu, Yibo
Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt is based on the recent trend that vendor libraries are increasingly modularized and reconfigurable via declarative control (e.g., CUTLASS). It enables a novel approach that bridges this gap and achieves the best of both worlds, via hardware-native templated search. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. Bolt demonstrates this concept by prototyping on a popular auto-tuner in TVM and a class of widely-used platforms (i.e., NVIDIA GPUs)--both in large deployment in our production environment. Bolt improves the inference speed of common convolutional neural networks by 2.5x on average over the state of the art, and it auto-tunes these models within 20 minutes. Example auto-tuners like AutoTVM (Chen Ansor (Zheng et al., 2020a) only achieves 20% of cuBLAS et al., 2018b) and Ansor (Zheng et al., 2020a) infer hardware performance for FP16 GEMMs on NVIDIA Tesla T4 GPUs cost models from afar, by executing sample implementations (see Figure 1 for more details). Building on the inferred cost models, auto-tuners take tensor Related, opaque device models also lead to a prolonged programs as inputs, and navigates a large search space to auto-tuning time, as the search process is less informed by select effective transformations for high performance.