Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom

Sen, Rishika, Roychowdhury, Sujoy, Soman, Sumit, Ranjani, H. G., Mohanty, Srikhetra

Apr-29-2025–arXiv.org Artificial Intelligence

Figure 1 shows the heatmap depicting performance of 16 combinations of KD for 14 metrics. For brevity, we also report the mean of all 14 metrics and group-wise metrics (N-gram metrics, embedding based metrics and Oracle-LLM metrics) in Figure 1. We systematically analyze the results and organize our findings as impact of (i) SFT (RQ1) (ii) SFT on teacher and student (RQ1) (iii) vocabulary and KD algorithm (RQ2) (iv) performance metrics groups (RQ3) 3.1 Impact of SFT We organize analysis with vocabulary as starting point: 3.1.1 Llama Consider the bar plots which depicts Llama as the teacher in Figure 1 i.e., the bars denoting (Llama, V anilla KD) and (Llama, DSKD). We observe that SFT of teacher/student/both results in improvement of performance irrespective of the training algorithm (first bar vs the subsequent 3 bars). The improvement is statistically significant (refer to H S train, H T train, H T,S trainin Table 3). Here, we observe that NH is rejected for most metrics (13 out of 14 for V anilla KD and 8 or 9 out of 14 for DSKD) with SFT of student or teacher or both for Llama vocabulary, irrespective of algorithms.

large language model, natural language, sft, (18 more...)

arXiv.org Artificial Intelligence

Apr-29-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.88)

Industry:
- Education (0.56)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found