A correlation-permutation approach for speech-music encoders model merging

Ritter-Gutierrez, Fabian, Lin, Yi-Cheng, Wong, Jeremy H. M, Lee, Hung-yi, Chng, Eng Siong, Chen, Nancy F.

arXiv.org Artificial Intelligence 

Simply permuting shallower layers or attempting to permute all transformer components indiscriminately leads to suboptimal outcomes. C. Layer-wise Permutation Analysis To understand how structural alignment varies across the depth of the models, we examined the percentage of channels permuted on each layer considered for permutation on MERT when using the CNN + "fnn+atnn" setup. Results are shown in Table II, where interestingly, it can be seen that most layers are completely permuted with the exception of the first CNN layer of MERT, where only 30.86% of channels were reordered. This suggests that the initial feature representations learned by both MERT and HuBERT at this shallow depth share considerable similarity. It is plausible that this first layer in both models learns to extract fundamental, low-level acoustic features, akin to filterbank-like representations. Therefore, the internal channel ordering for these basic features might already be substantially aligned between the two independently trained models, necessitating fewer permutations.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found