Optimizing Inference Performance of Transformers on CPUs

Feb-17-2021–arXiv.org Artificial Intelligence

This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing Transfomerbased The Transformer architecture revolutionized the field of natural models on CPUs. We identify the key component of the language processing (NLP). Transformers-based models (e.g., BERT) Transformer architecture where the bulk of the computation happens, power many important Web services, such as search, translation, namely, the matrix multiplication (matmul) operations, and question-answering, etc. While enormous research attention is paid propose three optimizations to speed them up. to the training of those models, relatively little efforts are made The first optimization is based on the observation that the performance to improve their inference performance. This paper comes to address of the matmul operation is heavily impacted not only this gap by presenting an empirical analysis of scalability by the shape (dimensions) of the source matrices and the available and performance of inferencing a Transformer-based model on computing resources (the number of worker threads), but also by CPUs.

matmul operation, matrix, opération, (13 more...)

arXiv.org Artificial Intelligence

Feb-17-2021

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States
  - Massachusetts > Middlesex County > Burlington (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found