Structural Knowledge Distillation

Wang, Xinyu, Jiang, Yong, Yan, Zhaohui, Jia, Zixia, Bach, Nguyen, Wang, Tao, Huang, Zhongqiang, Huang, Fei, Tu, Kewei

Oct-10-2020–arXiv.org Artificial Intelligence

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a smaller one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student's output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios: 1) the teacher and student share the same factorization form of the output structure scoring function; 2) the student factorization produces smaller substructures than the teacher factorization; 3) the teacher factorization produces smaller substructures than the student factorization; 4) the factorization forms from the teacher and the student are incompatible. Deeper and larger neural networks have led to significant improvement in accuracy in various tasks, but they are also more computationally expensive and unfit for resource-constrained scenarios such as online serving. An interesting and viable solution to this problem is knowledge distillation (KD) (Buciluǎ et al., 2006; Ba & Caruana, 2014; Hinton et al., 2015), which can be used to transfer the knowledge of a large model (the teacher) to a smaller model (the student). In the field of natural language processing, for example, KD has been successfully applied to compress massive pretrained language models such as BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) into much smaller and faster models without significant loss in accuracy (Tang et al., 2019; Sanh et al., 2019; Tsai et al., 2019; Mukherjee & Hassan Awadallah, 2020).

computational linguistic, proceedings, sequence, (14 more...)

arXiv.org Artificial Intelligence

Oct-10-2020

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - Texas > Travis County
      - Austin (0.04)
    - New York > New York County
      - New York City (0.04)
    - New Mexico > Santa Fe County
      - Santa Fe (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)
  - Hong Kong (0.04)

Genre:
- Research Report (0.82)

Industry:
- Education (0.90)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Grammars & Parsing (0.71)
  - Machine Learning > Neural Networks (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found