AITopics | Wang, Stephanie

Plotting

Wang, Stephanie

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Lin, Chien-Yu, Kamahori, Keisuke, Liu, Yiyu, Shi, Xiaoxiang, Kashyap, Madhav, Gu, Yile, Shao, Rulin, Ye, Zihao, Zhu, Kan, Wang, Stephanie, Krishnamurthy, Arvind, Kadekodi, Rohan, Ceze, Luis, Kasikci, Baris

arXiv.org Artificial IntelligenceFeb-28-2025

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.20969

Country:

Asia (0.28)
North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

Luan, Frank Sifei, Mao, Ziming, Wang, Ron Yifeng, Lin, Charlotte, Kamsetty, Amog, Chen, Hao, Su, Cheng, Veeramani, Balaji, Lee, Scott, Cho, SangBin, Zinzow, Clark, Liang, Eric, Stoica, Ion, Wang, Stephanie

arXiv.org Artificial IntelligenceFeb-16-2025

While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution. The key idea is to execute one partition at a time to allow lineage-based recovery with dynamic resource allocation. This enables memory-efficient pipelining across heterogeneous resources, similar to stream processing, but also offers the elasticity and fault tolerance properties of batch processing. We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3--8$\times$ compared to traditional batch and stream processing systems. When training Stable Diffusion, Ray Data matches the throughput of single-node ML data loaders while additionally leveraging distributed heterogeneous clusters to further improve training throughput by 31%.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.12407

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Industry:

Information Technology (0.55)
Energy > Oil & Gas (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture (1.00)

Add feedback

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Zihao, Chen, Lequn, Lai, Ruihang, Lin, Wuwei, Zhang, Yineng, Wang, Stephanie, Chen, Tianqi, Kasikci, Baris, Grover, Vinod, Krishnamurthy, Arvind, Ceze, Luis

arXiv.org Artificial IntelligenceJan-1-2025

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.01005

Country:

North America > Canada (0.67)
North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Composable Core-sets for Diversity Approximation on Multi-Dataset Streams

Wang, Stephanie, Flynn, Michael, Luo, Fangyu

arXiv.org Artificial IntelligenceAug-10-2023

Core-sets refer to subsets of data that maximize some function that is commonly a diversity or group requirement. These subsets are used in place of the original data to accomplish a given task with comparable or even enhanced performance if biases are removed. Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data; lending themselves to be used for streamed or distributed data. Recent work has focused on the use of core-sets for training machine learning models. Preceding solutions such as CRAIG have been proven to approximate gradient descent while providing a reduced training time. In this paper, we introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments. If combined with techniques such as CRAIG and heuristics to enhance construction speed, composable core-sets could be used for real time training of models when the amount of sensor data is large. We provide empirical analysis by considering extrapolated data for the runtime of such a brute force algorithm. This algorithm is then analyzed for efficiency through averaged empirical regression and key results and improvements are suggested for further research on the topic.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2308.05878

Country:

North America > United States (0.16)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

Learning to Play With Intrinsically-Motivated, Self-Aware Agents

Haber, Nick, Mrowca, Damian, Wang, Stephanie, Fei-Fei, Li F., Yamins, Daniel L.

Neural Information Processing SystemsDec-31-2018

Infants are experts at playing, with an amazing ability to generate novel structured behaviors in unstructured environments that lack clear extrinsic reward signals. We seek to mathematically formalize these abilities using a neural network that implements curiosity-driven intrinsic motivation. Using a simple but ecologically naturalistic simulated environment in which an agent can move and interact with objects it sees, we propose a "world-model" network that learns to predict the dynamic consequences of the agent's actions. Simultaneously, we train a separate explicit "self-model" that allows the agent to track the error map of its world-model. It then uses the self-model to adversarially challenge the developing world-model. We demonstrate that this policy causes the agent to explore novel and informative interactions with its environment, leading to the generation of a spectrum of complex behaviors, including ego-motion prediction, object attention, and object gathering. Moreover, the world-model that the agent learns supports improved performance on object dynamics prediction, detection, localization and recognition tasks. Taken together, our results are initial steps toward creating flexible autonomous agents that self-supervise in realistic physical environments.

agent, health & medicine, neural network, (20 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County (0.14)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(2 more...)

Add feedback

Learning to Play With Intrinsically-Motivated, Self-Aware Agents

Haber, Nick, Mrowca, Damian, Wang, Stephanie, Fei-Fei, Li F., Yamins, Daniel L.

Neural Information Processing SystemsDec-31-2018

agent, health & medicine, neural network, (20 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County (0.14)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Ray: A Distributed Framework for Emerging AI Applications

Moritz, Philipp, Nishihara, Robert, Wang, Stephanie, Tumanov, Alexey, Liaw, Richard, Liang, Eric, Paul, William, Jordan, Michael I., Stoica, Ion

arXiv.org Machine LearningDec-15-2017

The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray---a distributed system to address them. Ray implements a dynamic task graph computation model that supports both the task-parallel and the actor programming models. To meet the performance requirements of AI applications, we propose an architecture that logically centralizes the system's control state using a sharded storage system and a novel bottom-up distributed scheduler. In our experiments, we demonstrate sub-millisecond remote task latencies and linear throughput scaling beyond 1.8 million tasks per second. We empirically validate that Ray speeds up challenging benchmarks and serves as both a natural and performant fit for an emerging class of reinforcement learning applications and algorithms.

deep learning, neural network, scheduler, (22 more...)

arXiv.org Machine Learning

1712.05889

Genre: Research Report (0.84)

Industry:

Leisure & Entertainment > Games (0.46)
Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback