AITopics | Chen, Liangfu

Collaborating Authors

Chen, Liangfu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Inference Optimization of Foundation Models on AI Accelerators

Park, Youngsuk, Budhathoki, Kailash, Chen, Liangfu, Kübler, Jonas, Huang, Jiaji, Kleindessner, Matthäus, Huan, Jun, Cevher, Volkan, Wang, Yida, Karypis, George

arXiv.org Artificial IntelligenceJul-12-2024

Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep learning system frameworks, we deep dive into system optimization techniques for fast and memory-efficient attention computations and discuss how they can be implemented efficiently on AI accelerators. Next, we describe architectural elements that are key for fast transformer inference. Finally, we examine various model compression and fast decoding strategies in the same context.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2407.09111

Country:

Europe > Spain (0.16)
North America > United States (0.14)
Europe > Germany (0.14)
Europe > Switzerland (0.14)

Genre:

Overview (0.68)
Research Report (0.50)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

Athiwaratkun, Ben, Gonugondla, Sujan Kumar, Gouda, Sanjay Krishna, Qian, Haifeng, Ding, Hantian, Sun, Qing, Wang, Jun, Guo, Jiacheng, Chen, Liangfu, Bhatia, Parminder, Nallapati, Ramesh, Sengupta, Sudipta, Xiang, Bing

arXiv.org Artificial IntelligenceJul-11-2024

This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths. Bifurcated attention achieves this by strategically dividing the attention mechanism during incremental decoding into two separate GEMM operations: one focusing on the KV cache from prefill, and another on the decoding process itself. While maintaining the computational load (FLOPs) of standard attention mechanisms, bifurcated attention ensures precise computation with significantly reduced memory IO. Our empirical results show over 2.1$\times$ speedup when sampling 16 output sequences and more than 6.2$\times$ speedup when sampling 32 sequences at context lengths exceeding 8k tokens on a 7B model that uses multi-head attention. The efficiency gains from bifurcated attention translate into lower latency, making it particularly suitable for real-time applications. For instance, it enables massively parallel answer generation without substantially increasing latency, thus enhancing performance when integrated with post-processing techniques such as re-ranking.

accelerating massively parallel decoding, artificial intelligence, natural language, (3 more...)

arXiv.org Artificial Intelligence

2403.08845

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.40)

Add feedback