Differentiable Subset Pruning of Transformer Heads

Li, Jiaoda, Cotterell, Ryan, Sachan, Mrinmaya

Jul-27-2023–arXiv.org Artificial Intelligence

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.

machine learning, natural language, pruning, (18 more...)

arXiv.org Artificial Intelligence

Jul-27-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - New York > New York County
      - New York City (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Czechia > Prague (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Switzerland > Zürich
    - Zürich (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- Asia
  - South Korea (0.04)
  - Vietnam > Hanoi
    - Hanoi (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Machine Translation (0.67)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Statistical Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found