AITopics | Shi, Xiaoxiang

Collaborating Authors

Shi, Xiaoxiang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Lin, Chien-Yu, Kamahori, Keisuke, Liu, Yiyu, Shi, Xiaoxiang, Kashyap, Madhav, Gu, Yile, Shao, Rulin, Ye, Zihao, Zhu, Kan, Wang, Stephanie, Krishnamurthy, Arvind, Kadekodi, Rohan, Ceze, Luis, Kasikci, Baris

arXiv.org Artificial IntelligenceFeb-28-2025

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.20969

Country:

Asia (0.28)
North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Luo, Michael, Shi, Xiaoxiang, Cai, Colin, Zhang, Tianjun, Wong, Justin, Wang, Yichuan, Wang, Chi, Huang, Yanping, Chen, Zhifeng, Gonzalez, Joseph E., Stoica, Ion

arXiv.org Artificial IntelligenceFeb-19-2025

Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.13965

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification

Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Wang, Zeyu, Wong, Rae Ying Yee, Zhu, Alan, Yang, Lijie, Shi, Xiaoxiang, Shi, Chunan, Chen, Zhuoming, Arfeen, Daiyaan, Abhyankar, Reyna, Jia, Zhihao

arXiv.org Artificial IntelligenceAug-16-2023

This approach is also called autoregressive decoding because each The high computational and memory requirements of generative generated token is also used as input for generating future large language models (LLMs) make it challenging tokens. This dependency between tokens is crucial for many to serve them quickly and cheaply. This paper introduces NLP tasks that require preserving the order and context of the SpecInfer, an LLM serving system that accelerates generative generated tokens, such as text completion [53]. LLM inference with speculative inference and token tree Existing LLM systems generally use an incremental decoding verification. A key insight behind SpecInfer is to combine approach to serving a request where the system computes various collectively boost-tuned small language models to the activations for all prompt tokens in a single step and then jointly predict the LLM's outputs; the predictions are organized iteratively decodes one new token using the input prompt as a token tree, whose nodes each represent a candidate and all previously generated tokens. This approach respects token sequence. The correctness of all candidate token sequences data dependencies between tokens, but achieves suboptimal represented by a token tree is verified against the runtime performance and limited GPU utilization, since the LLM in parallel using a novel tree-based parallel decoding degree of parallelism within each request is greatly limited in mechanism. SpecInfer uses an LLM as a token tree verifier the incremental phase. In addition, the attention mechanism of instead of an incremental decoder, which significantly Transformer [46] requires accessing the keys and values of all reduces the end-to-end latency and computational requirement previous tokens to compute the attention output of a new token.

machine learning, natural language, specinfer, (19 more...)

arXiv.org Artificial Intelligence

2305.09781

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

Lin, Zhiqi, Miao, Youshan, Liu, Guodong, Shi, Xiaoxiang, Zhang, Quanlu, Yang, Fan, Maleki, Saeed, Zhu, Yi, Cao, Xu, Li, Cheng, Yang, Mao, Zhang, Lintao, Zhou, Lidong

arXiv.org Artificial IntelligenceJan-21-2023

With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs for execution. Due to the large search space, the contemporary parallelization plan generators often rely on empirical rules that couple transformation and scheduling, and fall short in exploring more flexible schedules that yield better memory usage and compute efficiency. This tension can be exacerbated by the emerging models with increasing complexity in their structure and model size. SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. Such a principled approach decouples multiple seemingly intertwined factors and enables the composition of highly flexible parallelization plans. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup compared to state-of-the-art solutions like DeepSpeed, Megatron and Alpa, for emerging DNN models like Swin-Transformer and AlphaFold2, as well as well-optimized models like GPT-3.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2301.08984

Country: Europe (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback