Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

Barad, Haim, Aidova, Ekaterina, Gorbachev, Yury

Nov-8-2023–arXiv.org Artificial Intelligence

Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.

generative ai, leveraging speculative sampling, openvino

arXiv.org Artificial Intelligence

Nov-8-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Generation (0.40)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.40)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found