A correlation-permutation approach for speech-music encoders model merging

Ritter-Gutierrez, Fabian, Lin, Yi-Cheng, Wong, Jeremy H. M, Lee, Hung-yi, Chng, Eng Siong, Chen, Nancy F.

Jun-16-2025–arXiv.org Artificial Intelligence

Simply permuting shallower layers or attempting to permute all transformer components indiscriminately leads to suboptimal outcomes. C. Layer-wise Permutation Analysis To understand how structural alignment varies across the depth of the models, we examined the percentage of channels permuted on each layer considered for permutation on MERT when using the CNN + "fnn+atnn" setup. Results are shown in Table II, where interestingly, it can be seen that most layers are completely permuted with the exception of the first CNN layer of MERT, where only 30.86% of channels were reordered. This suggests that the initial feature representations learned by both MERT and HuBERT at this shallow depth share considerable similarity. It is plausible that this first layer in both models learns to extract fundamental, low-level acoustic features, akin to filterbank-like representations. Therefore, the internal channel ordering for these basic features might already be substantially aligned between the two independently trained models, necessitating fewer permutations.

hubert, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jun-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.14)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment (1.00)
- Media > Music (0.94)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (1.00)
  - Representation & Reasoning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found