STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining

Guo, Liwei, Choe, Wonkyo, Lin, Felix Xiaozhu

Jan-31-2023–arXiv.org Artificial Intelligence

Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jan-31-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Virginia (0.05)
    - District of Columbia > Washington (0.04)
    - Massachusetts (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Florida > Orange County
      - Orlando (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Utah > Salt Lake County
      - Salt Lake City (0.04)
    - Colorado > Adams County
      - Westminster (0.04)
    - California
      - Los Angeles County > Long Beach (0.04)
      - Santa Clara County > Stanford (0.04)
    - New York > New York County
      - New York City (0.04)
  - Puerto Rico > San Juan
    - San Juan (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.05)
- Europe
  - United Kingdom (0.04)
  - Denmark (0.04)
  - Belgium (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Greece > Attica
    - Athens (0.04)
  - Germany > Saxony
    - Dresden (0.04)
  - France > Brittany
    - Ille-et-Vilaine > Rennes (0.04)
- Asia
  - China > Hong Kong (0.04)
  - South Korea > Seoul
    - Seoul (0.04)
  - Japan > Honshū
    - Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
  - India > NCT
    - New Delhi (0.04)
    - Delhi (0.04)
- Africa
  - Mali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.88)

Technology:
- Information Technology
  - Hardware (1.00)
  - Communications (1.00)
  - Artificial Intelligence
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found