Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Bae, Sangmin, Ko, Jongwoo, Song, Hwanjun, Yun, Se-Young
–arXiv.org Artificial Intelligence
To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the subsequent token. However, we observed several shortcomings, including performance degradation caused by a state copying mechanism or numerous exit paths, and sensitivity to exit confidence thresholds. Consequently, we propose a Fast and Robust Early-Exiting (FREE) framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. Furthermore, as parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator that exploits a Beta mixture model to determine suitable confidence thresholds. We empirically demonstrated the superiority of our proposed framework on extensive generation tasks.
arXiv.org Artificial Intelligence
Oct-9-2023
- Country:
- South America > Chile
- North America
- Dominican Republic (0.04)
- United States
- Texas (0.04)
- Pennsylvania (0.04)
- Washington > King County
- Seattle (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Europe
- Asia
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- Genre:
- Research Report (1.00)
- Technology: