PALBERT: Teaching ALBERT to Ponder

Oct-11-2024, 05:01:45 GMT–Neural Information Processing Systems

Currently, pre-trained models can be considered the default choice for a wide range of NLP tasks. Despite their SoTA results, there is practical evidence that these models may require a different number of computing layers for different input sequences, since evaluating all layers leads to overconfidence in wrong predictions (namely overthinking). This problem can potentially be solved by implementing adaptive computation time approaches, which were first designed to improve inference speed. Recently proposed PonderNet may be a promising solution for performing an early exit by treating the exit layer's index as a latent variable. However, the originally proposed exit criterion, relying on sampling from trained posterior distribution on the probability of exiting from the i -th layer, introduces major variance in exit layer indices, significantly reducing the resulting model's performance. In this paper, we propose improving PonderNet with a novel deterministic Q-exit criterion and a revisited model architecture.

palbert, ponder, teaching albert, (4 more...)

Neural Information Processing Systems

Oct-11-2024, 05:01:45 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.80)