Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom

Sen, Rishika, Roychowdhury, Sujoy, Soman, Sumit, Ranjani, H. G., Mohanty, Srikhetra

arXiv.org Artificial Intelligence 

Figure 1 shows the heatmap depicting performance of 16 combinations of KD for 14 metrics. For brevity, we also report the mean of all 14 metrics and group-wise metrics (N-gram metrics, embedding based metrics and Oracle-LLM metrics) in Figure 1. We systematically analyze the results and organize our findings as impact of (i) SFT (RQ1) (ii) SFT on teacher and student (RQ1) (iii) vocabulary and KD algorithm (RQ2) (iv) performance metrics groups (RQ3) 3.1 Impact of SFT We organize analysis with vocabulary as starting point: 3.1.1 Llama Consider the bar plots which depicts Llama as the teacher in Figure 1 i.e., the bars denoting (Llama, V anilla KD) and (Llama, DSKD). We observe that SFT of teacher/student/both results in improvement of performance irrespective of the training algorithm (first bar vs the subsequent 3 bars). The improvement is statistically significant (refer to H S train, H T train, H T,S trainin Table 3). Here, we observe that NH is rejected for most metrics (13 out of 14 for V anilla KD and 8 or 9 out of 14 for DSKD) with SFT of student or teacher or both for Llama vocabulary, irrespective of algorithms.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found