SpeedLimit: Neural Architecture Search for Quantized Transformer Models

Chai, Yuji, Bailey, Luke, Jin, Yunho, Karle, Matthew, Ko, Glenn G., Brooks, David, Wei, Gu-Yeon, Kung, H. T.

Oct-13-2023–arXiv.org Artificial Intelligence

While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Oct-13-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - New Finding (0.48)
  - Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found