Merging Models with Fisher-Weighted Averaging

Neural Information Processing Systems 

Averaging the parameters of models that have the same architecture and initialization can provide a means of combining their respective capabilities. In this paper, we take the perspective that this merging operation can be seen as choosing parameters that approximately maximize the joint likelihood of the posteriors of the models' parameters.