AITopics | Ruhle, Victor

Collaborating Authors

Ruhle, Victor

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

Kang, Hao, Bharadwaj, Srikant, Hensman, James, Krishna, Tushar, Ruhle, Victor, Rajmohan, Saravan

arXiv.org Artificial IntelligenceDec-17-2024

While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x Large language models (LLMs) (Touvron et al., 2023; Gunasekar et al., 2023; Brown et al., 2020) have excelled The bottlenecks during LLM inference can be split into in tasks like natural language understanding (Joshi et al., three major sections: the linear projection operations (QKV 2017; Dodge et al., 2021) and generative text production projection and FFN), the memory-intensive Key/Value (KV) (Hendrycks et al., 2021; Zhong et al., 2017).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.08585

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Ding, Dujian, Mallick, Ankur, Wang, Chi, Sim, Robert, Mukherjee, Subhabrata, Ruhle, Victor, Lakshmanan, Laks V. S., Awadallah, Ahmed Hassan

arXiv.org Artificial IntelligenceApr-22-2024

Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2404.14618

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance

Zhang, Xuchao, Xia, Menglin, Couturier, Camille, Zheng, Guoqing, Rajmohan, Saravan, Ruhle, Victor

arXiv.org Artificial IntelligenceAug-8-2023

Retrieval augmented models show promise in enhancing traditional language models by improving their contextual understanding, integrating private data, and reducing hallucination. However, the processing time required for retrieval augmented large language models poses a challenge when applying them to tasks that require real-time responses, such as composition assistance. To overcome this limitation, we propose the Hybrid Retrieval-Augmented Generation (HybridRAG) framework that leverages a hybrid setting that combines both client and cloud models. HybridRAG incorporates retrieval-augmented memory generated asynchronously by a Large Language Model (LLM) in the cloud. By integrating this retrieval augmented memory, the client model acquires the capability to generate highly effective responses, benefiting from the LLM's capabilities. Furthermore, through asynchronous memory integration, the client model is capable of delivering real-time responses to user requests without the need to wait for memory synchronization from the cloud. Our experiments on Wikitext and Pile subsets show that HybridRAG achieves lower latency than a cloud-based retrieval-augmented LLM, while outperforming client-only models in utility.

machine learning, natural language, real time system, (19 more...)

arXiv.org Artificial Intelligence

2308.04215

Country:

North America > United States (0.46)
Europe > United Kingdom > England > Leicestershire > Leicester (0.14)

Genre: Research Report (0.64)

Industry:

Information Technology (1.00)
Leisure & Entertainment > Games > Computer Games (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback