Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

Neural Information Processing Systems 

Large language models (LLMs) are highly capable but also computationally expensive.