Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

Dai, Yinwei, Pan, Rui, Iyer, Anand, Li, Kai, Netravali, Ravi

arXiv.org Artificial Intelligence 

Ramp predictions with sufficiently high confidence existing platform knobs (e.g., batch sizes) fail to ease this (subject to a threshold) exit the model, foregoing fundamental tension, and instead only enable users to harshly downstream layers and bringing corresponding savings in trade off one property for the other. This paper explores an both compute and latency. The intuition is that models alternate strategy to taming throughput-latency tradeoffs by are often overparameterized (especially given recent model changing the granularity at which inference is performed. We growth [31, 32, 48]), and certain'easy' inputs may not require present Apparate, a system that automatically applies and complete model processing for accurate results. Importantly, manages early exits (EEs) in ML models, whereby certain unlike existing platform knobs (e.g., batch size) inputs can exit with results at intermediate layers. To cope that simply walk the steep latency-throughput tradeoff curve, with the time-varying overhead and accuracy challenges that EEs rethink the granularity of inference on a per-input basis. EEs bring, Apparate repurposes exits to provide continual This, in turn, provides a path towards lowering request latencies feedback that powers several novel runtime monitoring and without harming platform throughputs.